Anzacathon: the benefits of Open Data


25 April is Anzac Day in Australia, New Zealand and many communities around the world where Anzacs have served. Given the shift to online collaboration in 2020, Anzacathon has been set up to help people engage online.

One of the key themes of Anzacathon is data: finding new ways to use the data and also demonstrating the benefits of community engagement with open data.

With that in mind, I'm providing some tutorials for users of PostgreSQL and R to access the data and some simple examples to use it.

Data access over IPFS

The raw data files are shared over IPFS. They are in SQLite format, as this provides a convenient mechanism to query the data with SQL commands directly over IPFS.

The pgloader tool provides a convenient way to load SQLite databases directly from IPFS into a PostgreSQL schema. There is no need to download the SQLite files, we can simply mount IPFS like a filesystem. A sample configuration for PostgreSQL users is provided.

For R users, there is an RSQLite module that allows R to access SQLite data. Once again, there is no need to download the data, any file in IPFS can be opened directly using fuse.

R users who want to use this method first need to follow the first part of the PostgreSQL setup instructions, up to the point where you start the IPFS daemon process.

Finding the lone Anzacs

Every year, there are huge gatherings at Gallipoli and other sites where large numbers of Anzacs died. Due to Coronavirus, those gatherings won't take place in 2020.

Therefore, we have the opportunity to discover lesser known places where Anzacs have died. In one place I discovered in the French alps, there is a single Australian airman buried alongside his British colleagues. After a quick search of the National Archives and a hike up the mountain with my camera, I created a visual story about their mystery.

How many more cemeteries in France and other places contain a single Anzac like this? We can't ask that question through any of the web sites but it is very easy with SQLite or PostgreSQL:

SELECT cemeterymemorial, country, COUNT(cemeterymemorial)
FROM cwgc_casualty
GROUP BY cemeterymemorial, country
HAVING COUNT(cemeterymemorial) = 1;

The query lists 1,143 cemeteries around the world having a single Anzac.

We could further refine that to focus on a single country, such as France:

SELECT cemeterymemorial, country, COUNT(cemeterymemorial)
FROM cwgc_casualty
WHERE country = 'France'
GROUP BY cemeterymemorial, country
HAVING COUNT(cemeterymemorial) = 1;

Of 1,143 cemeteries around the world, 192 are in France.

Using a subquery, we can see the names, services numbers and dates when those Anzacs died:

SELECT cemeterymemorial, country, forename, surname, TRIM(servicenumberexport, '''') as servicenumber, date_of_death, date_of_death2
FROM cwgc_casualty
WHERE cemeterymemorial IN (
SELECT cemeterymemorial
FROM cwgc_casualty
GROUP BY cemeterymemorial
HAVING COUNT(cemeterymemorial) = 1);

For convenience, I'm providing that list as a spreadsheet that you can download and explore.

With that list, it is fairly easy to get the name and service number of the relevant Anzac and look up the scanned copy of his paper file at the National Archives. Given the unique circumstances of these casualties, their files often contain something notable. For example, some of them were on particularly dangerous missions deep behind enemy lines during the time France was occupied.

Anzacs within a given radius (PostgreSQL / PostGIS)

The CWGC web site allows people to search for Anzacs by specificying regions, such as the department within France. There are 95 departments, far more than the 6 states in Australia. It can be a lot more convenient to search by distance/radius from of a point where you live or plan to travel.

Fortunately, we have the PostGIS extension. The PostGIS FAQ includes a specific example, the best way to find all objects within a radius of another object, using the ST_DWithin function.

The PostgreSQL setup documentation includes the necessary code to enable PostGIS and add extra columns to the tables containing the PostGIS objects encapsulating latitude ang longitude values.

This is a basic example of how to use ST_DWithin to obtain sites within a specified radius, returning the distances in kilometers:

SELECT cemetery_desc,
ROUND(ST_Distance(location_gis, ('SRID=4326;POINT(46.5535 6.6523)')::geography)/1000) AS d
FROM iwmcemeteries
WHERE ST_DWithin(location_gis,('SRID=4326;POINT(46.5535 6.6523)')::geography, 100000)
ORDER BY d;

The schema also includes a view, anzac_sites, that uses the SQL UNION mechanism to concatenate both the CWGC and TracesOfWar tables. We can access results from both tables using a single query like this:

SELECT source, description,
ROUND(ST_Distance(location_gis, ('SRID=4326;POINT(46.5535 6.6523)')::geography)/1000) AS d
FROM anzac_sites
WHERE ST_DWithin(location_gis,('SRID=4326;POINT(46.5535 6.6523)')::geography, 100000)
ORDER BY d;

Anzac family names (using R)

To begin, it is necessary to install the R modules RSQLite and Plyr. On a Debian system, that can be done with:

apt install r-cran-rsqlite r-cran-plyr

and on any other type of system:

install.packages("RSQLite")
install.packages("plyr")

As discussed above, make sure that the IPFS daemon is running. See the first part of the PostgreSQL setup instructions.

Start the R command line and from there, it is possible to verify you have access to the data over IPFS:

library(DBI)
cwgc_cemeteries <- dbConnect(RSQLite::SQLite(), "/ipfs/QmRgD8xHJXKGwE1S1YySUKQGt25EK8R2W1HTCCEA7sKDLy")
dbListTables(cwgc_cemeteries)
cwgc_casualty <- dbConnect(RSQLite::SQLite(), "/ipfs/QmPVkHJrSYoeig71EzrxU45zTMbFefQgwdRDSHv7fpjChA")
dbListTables(cwgc_casualty)

To verify this, we can load the entire content of one table into a data frame and count the frequency of surnames:

data <- dbReadTable(cwgc_casualty, "cwgc_casualty")
summary(data)

library(plyr)
fd = count(data, 'surname')
index <- with(fd, order(freq, surname))
tail(fd[index,])

The results will look something like this:

       surname freq
18130 TAYLOR 467
19939 WILSON 549
9558 JONES 590
2303 BROWN 628
19901 WILLIAMS 635
17182 SMITH 1376

While this may be a trivial example, it demonstrates that we can use datasets from the IPFS cloud directly in R.

Making data personal

A British police chief suggested on 30 March that messages about flattening the curve were not getting through to many people. Five days later, the British PM was admitted to hospital with Coronavirus. Most of Britain spent the next week intensely following the news as he went in and out of intensive care. This example demonstrates the power of personal stories and examples over statistics and charts.

Using the tools described above, we can examine this data set at scale and also hone in on personal stories and acute examples of tragedy. Reading through the National Archives, there are plenty of personal letters from parents, spouses and children of missing Anzacs. It is particularly important to be respectful with a data set like this.

Why not attack Coronavirus this weekend?

The EUvsVirus hackathon takes place the same weekend as Anzac Day. Some people asked me why I don't put energy into that instead.

There are many answers to that question.

One that is on the top of my mind is that I like to finish things. Together with a dedicated group of volunteers in Kosovo, we had started looking at this data in 2019 and I feel this is a great way to take what I've learnt and hand it off to the crowd.

There have already been a number of dedicated events like the Bio-hackathon. For people unfamiliar with the science, it can be difficult to simultaneously learn about data science and bioinformatics. The Anzac data set provides an opportunity to boost skills with data science while working with an important data set that many people can already understand.

Innovations from this hackathon, such as the use of IPFS to share data sets between participants are directly transferrable to bio-hackathons and many other use cases.

Initial directions for studying the data

Here are some thoughts that come to mind: