In Switzerland, opendata.swiss is the go-to location for any open dataset resulting from federal, cantonal or municipal sources. From a societal and economics perspective, the portal is an important asset following the “protect private data, make use of public data” mantra, and has already led to digital innovation through the availability of many third-party applications. In this research blog post, we look at some numbers associated with the portal.
According to the website, 6994 datasets are present. However, the API reveals that the actual counter is already slightly higher at 7008. Each dataset can reference several resources, referring to downloadable files or pointers to other APIs. With a total of 28436 resources, each dataset thus links to four concrete resources on average. Each resource is of a particular type. In total, opendata.swiss contains resources of 78 types, including some that are aliases for others. This reflects a consistency issue; the recommendation is to enforce unity especially in using the mime-type metadata.
The top 14 resource types all have a three-digit amount of downloadable or queryable resources. They represent 14% of resource types, but 85% of resources. By far the biggest type are HTML pages (text/html) with more than 10’000 resources. The complete top-14 follows here:
- text/html: 10149 (web pages)
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet: 4021 (newer Excel files, i.e. XLSX)
- application/vnd.ms-excel: 2366 (older Excel files, i.e. XLS)
- application/pdf: 2089 (PDF files)
- text/csv: 1533 (including 143 inconsistently tagged as CSV)
- application/zip: 846 (including 390 inconsistently tagged as ZIP)
- application/json: 729 (including 144 inconsistently tagged as JSON)
- MULTIFORMAT: 563
- WMS: 464 (OGC/Web Map Service)
- WFS: 341 (OGC/Web Feature Service)
- SHP/ESRI Shapefile: 304 (without proper mime-type; these are GIS files)
- dxf/DXF: 268 (again without proper mime-type)
- text/xml: 211
- gpkg: 168 (OGC/GeoPackage – SQlite format with GIS content)
The number of easily machine-readable resources (in XML, JSON or CSV) formats is quite low compared to the unstructured or semi-structured formats (PDF, HTML) and the structured but proprietary formats (XLS(X)). Given the focus on openness, it is somewhat surprising that not more data is made available in text file and open binary file formats, and indeed approximately one quarter of all resources have non-open mimetypes. For most of them at least read-only support is however available one way or another.
Beside the mimetpye consistency issues mentioned beforehand, there are more issues that could be fixed on the portal side or on the data supplier side. Sometimes, the URLs were registered as IRLs with umlauts, then presumably somebody had an issue downloading them, and they were renamed on the server without changing the registration in opendata.swiss. An example from the city of Bern: registration, actual file – both differ only in a vs. ä.
Some of the content is furthermore subject to spelling issues. The word ‘initiative’ is quite important in Swiss democracy, yet the canton of Zug manages to provide three different spellings, including two wrong ones in five occations: example1, example2.
Some resources cannot be accessed (anymore, perhaps). For this case, a systematic and regular HTTP check should be performed, along with tagging a resource as broken if the check fails. Among the 1390 CSV files correctly tagged with the “text/csv” mime-type alone, 1 is invalid due to the absence of a URL field in the portal (another consistency issue), 26 are empty and have to be discarded, 23 require smart rewriting (e.g. missing HTTP(S) at the start) and can be rescued this way, and 96 fail for various reasons, most with error 404 – not found. Hence, only 1267 (91%) can be retrieved, which is a somewhat low number in the cloud era with an expected 99.99% availability.
The 1267 CSV files bring a weight of 57 GB. Among them, VBZ tram and bus punctuality measurements collected in 225 files come in at 53 GB alone. Another notworthy subset are laws and initiatives from three cantons (ZH, GR, ZG) with 159 files but only 4 MB in size. The single biggest file is the GA/half-tax fare information with 347 MB which is along with many other datasets provided by SBB.
Looking closer at the content of the CSV files — and some of that applies to JSON, XML and XLS files as well — we can distinguish certain types depending on the presence of temporal, spatial or another order. Datasets on laws are usually static. The VBZ data is incremental in series, with one file per week, i.e. the historic files are static. Others are updated: The archive of parliamentary voting chronologically; the SBB employees per canton, as well as the weather history by MeteoSwiss, semi-chronologically where the date is secondary to another attribute, e.g. the weather station or work canton. The SBB train locations are updated by overwriting the previous data in real-time. And finally, the living locations of the members of the Zurich parliament (KRDB) is not showing any apparent order, and yearly updates appear in random locations.
We had some fun analysing the data and hope that it will serve not only us, but also others, to think of how to increase data quality and handling comfort to drive the next generation of data-exploiting applications. We also hope that the weaknesses identified in the portal and some datasets lead to improvements in the offering.