The idea for this workshop is to extract conceptual structures that provide insights about Wikipedia data or allow users to explore Wikipedia data. Participants can use any one or any combination of the data sets below or even combine these data with other data about Wikipedia.
The files are pipe-delimited csv-files. Further descriptions are provided below the table. Most likely all of the files are too big to be directly processed by FCA and CG software and require use of data selection techniques or mining.
|file name||zipped filesize||unzipped filesize||n-tuple||format||nr of rows||nr of objects||nr of attributes|
|article_category.csv||135MB||700MB||pair||page|category||12,161,691||3 million||0.6 million|
|cat_concept||4MB||17MB||single||names of categories||632,615||N/A||N/A|
|cat_related.csv||209K||1MB||pair||category|see also category||19,072||12,805||13,134|
This data was provided as triples by DBpedia, but since the middle element is always <http://purl.org/dc/terms/subject>, we have deleted the middle elements and turned it into a binary relation. On the left are DBpedia resources (i.e. regular Wikipedia pages) and on the right are DBpedia/Wikipedia Categories.
Originally in DBpedia this was a single file, but we broke it up into 3 files and omitted some of the information. The first file contains categories and their broader categories. The second one contains just the names of categories. These should be the same as the categories (attributes) in the file article_category.csv and a superset of the objects and attributes in the file cat_broader.csv. The last file contains categories and their "see also" linked categories.
The original DBpedia file also contained mappings from the URL encoded names to the plain names of categories. These have been omitted in our data set.
This is the data from Wikipedia infoboxes, which are the boxes on the top right-hand side of many Wikipedia pages (such as this one). The data is represented as triples: first the name of the Wikipedia page, then the name of the property, then the value of the property. The names of the properties are described using terms from an ontology. According to DBpedia, this ontology is more consistent than the terms used on the original Wikipedia pages which show some variation. The values of the properties can be literal values (followed by ^^ and the name of a unit) or links to other pages.
In principle, each infobox type could be considered a conceptual graph or a many-valued formal context.
The original DBpedia files are about 3-4 times larger than our files because each entry is represented as a URL. These Perl scripts were used to omit the URLs and thus reduce the file sizes. If participants prefer they could use the original DBpedia files instead of the ones we prepared.