…nein, nicht ganz, denn wir befinden uns noch mitten im April. Trotzdem hat sich das d:swarm-Team an die Arbeit gemacht, die aktuelle d:swarm-Webseite etwas zu erneuern und aktualisieren. Es sollte nun eine (hoffentlich) klare Struktur existieren und die Inhalte und Informationen auch entsprechend gut erreichbar sein. Bitte teilt uns gern mit, ob ich die neue Seitenstruktur und deren Inhalte so gefällt, oder auch nicht, oder ob Ihr noch irgendwelche Inhalte und Informationen vermisst bzw. nicht richtig oder vollständig dargestellt seht.
Archiv für den Autor: Thomas Gängler
d:swarm – v0.9.1 release
Wir sind froh, eine (weitere) neue Version von d:swarm präsentieren zu können, die interessante neue Funktionen mitbringt. Weiterhin ist d:swarm erfolgreich an der SLUB Dresden implementiert und beliefert ab sofort den Bibliothekskatalog im Produktivbetrieb. Weil wir unseren letzten Release zur ELAG 2015 hier im Blog gar nicht angekündigt haben, versuchen wir jetzt alle Neuerungen seit dem letzten offiziellen Release im April aufzuführen.
Also hier ist die Liste der Funktionalitäten, die seit dem letzten Release hinzugekommen sind oder erweitert wurden:
- JSON import
- extend support of inbuilt schemata, e.g.,
- OAI-PMH + DC Terms, OAI-PMH + DC Elements, OAI-PMH + MARCXML, PNX, MARCXML, MABXML (since ELAG 2015 release)
- OAI-PMH + DC Elements + Europeana (since v0.9.1)
- task processing “on-the-fly” (streaming variant via Task Processing Unit (TPU) + streaming enabled at d:swarm backend):
- enables a large(r) scale data processing at an acceptable amount of time
- mappings can be taken from multiple projects
- TPU tasks can be executed in parallel threads, when a data source is split into multiple chunks
- XML can be exported on-the-fly (i.e. the result of the task processing is not stored centrally in the Data Hub)
- input schemata can be re-utilised (e.g. useful for CSV data)
- new transformation functions:
- regex map lookup (regexlookup): a lookup map that allows regular expression as keys
- dewey validating/parsing (dewey)
- HTTP API request function (http-api-request)
- JSON parse function (parse-json)
- (real) collect function (collect)
- numeric filter (numfilter)
- ISSN validation/parsing (issn)
- convert value (convert-value): to be able to create, e.g., resources (instead of literals) as object
- enhanced transformation functions:
- SQL map (sqlmap)
- new filter functions:
- numeric filter
- (regex was default)
- skip filter: for whole job, i.e., only records that match the criteria of the filter will be emitted
- record search/selection: to be able to execute preview/test tasks @ Backoffice with specific records
- various maintenance/productivity improvements, e.g.,
- copy mappings (to new project with input data model with very similar schema, e.g., a schema that was generated by given example data to an inbuilt schema)
- migrate mappings (to new project with input data model with somehow similar schema, e.g., OAI-PMH MARCXML to MARCXML)
- enhanced error handling/ more expressive logging + more expressive messages for user
- various bug fixes and improvements over all parts of the application (frontend, backend, graph extension, …)
- XML enhancing step (experimental)
- (deprecate data model (backend only))
- (deprecate records in a data model (backend only))
- scale Data Hub (mainly optimize storage footprint + re-write some indexing approaches)
- transformation logic widget (arrow drawing) has been redesigned
Wir freuen uns über Tests der Erweiterungen und Verbesserungen auf unserem Live-Demo-Server und über jede Form von Feedback über einen der vielen Kommunikationskanäle: Twitter, Gitter, Google+, Mailingliste, Issue Tracker or einfach per E-Mail an team[at]dswarm[dot]org.
Viel Spaß beim Testen und Evaluieren!
Data Hub Scaling Challenges
preface: This blog post is generally intended for a more technical audience, i.e., software developers, administrators or knowledge engineers. Albeit, all other interested people are also invited to read it 😉
d:swarm is successfully deployed and in production mode at SLUB Dresden right now. At the current state we make use of the streaming variant. This variant has especially been chosen to overcome the shortcomings that have been experienced via applying larger amounts of data through the Data Hub of d:swarm. This blog post describes the current state and experiences that has been made with the Data Hub in d:swarm and, finally, the challenges that needs to be tackled when scaling it to approach the vision of a central knowledge graph.
Current State of the Data Hub
Currently, all content data (bibliographic records – a collection of statements) in d:swarm is stored in a graph database, which is Neo4j right now. We’ve implemented a graph data model (GDM) on top of the Property Graph data model that leverages features of the RDF data model (see also here for a comparison of our GDM vs RDF). We’ve chosen this database type first, since we could fit very naturally the graph data model that fits all our requirements into the the Property Graph data model. Amongst other following demands needs to be fulfilled:
- storing data from arbitrary data sources in a generic format without information loss,
- qualified attributes (e.g. order or index position) or
- versioning at a statement-level base
All content data will be processed via an unmanaged extension that runs at the Neo4j database server to provide custom HTTP APIs to process content data more efficiently between the d:swarm backend and the database that was chosen as Data Hub.
Testing, Testing, Testing
Since we started intensively working at scaling the write performance at our Data Hub implementation in April 2015, i.e., first testing the Data Hub with some larger data sets via different write opportunities, we tackled some challenges that turned out through our testing. Neo4j support two different write modi: batch insert, where the database needs to be offline and transactional, i.e. at a running database instance. We could identify two main demands from our testing observations:
- reduce the storage amount of the database itself (initially we observed storage footprint with a ratio from 1:72 (original source data vs. processed GDM data in the Data Hub))
- increase the speed (throughput) at write operations
… and Iterate
So we tried to reduce the amount of information that really needs to be stored in the data hub as much as possible. One the one hand, we eliminated all unnecessary or duplicated information, e.g., write the type (/class) information of resources or bnodes only as node label and not as separate relationship to a type node. Those often occurring type relationships can „end“ in a super nodes (/dense nodes), i.e. nodes with many ingoing (or outgoing) relationships. On the other hand, we tried to make necessary information as compact as possible, e.g., via hashing. We are applying robust and fast hashing algorithms. Currently, we make use of SipHash. We even hash UUIDs Another approach that we’ve implemented to tackle information reduction is compressing. For example, we make use of prefixed URIs (instead of full-fledged ones).
Finally, we made use of another index engine (MapDB) for exact-match indices, which are the majority of our indices. For example, an index that guarantees that each statement does only exists once per data model and version range. So this index acts like a constraint or entry barrier.
… to Improve and Be Prepared for Scaling
With all those improvements we were able to reduce the storage footprint of the Data Hub by a factor from 3 to 7, i.e., depending on the structure of the original data source (e.g. flat or more hierarchical) we now have ratios from 1:10 to ~ 1:23. The resulting data storage footprint ratio are applicable for us right now, since we need to keep in mind that we usually add additional information to the data in its original format, e.g., versioning or provenance information. The major storage reduction was achieved by the introduction of prefixed URIs. This slackened the processing a bit and made it a bit more complex, i.e., a mechanism and index to do the namespace/prefix resolving was necessary.
… but Are We There Yet?
All the improvements that we’ve implemented so far mainly dealt with reducing the amount of the database itself. However, we weren’t really able to get on the second challenge – increasing the speed at write operations. Since, we prefer to do everything at a running database instance, we can only rely on the transactional processing. So we cannot take advantages of the batch mode at this time. A processing (import + transform + export) of rather flat XML source data – 1.2 million records – still took ~ 13.7 hours. To illustrate this (one of many) dataset a bit more, these are ~ 121 million statements. Whereby, a records consists of ~ 20 – 250 statements.
This duration was not acceptable for SLUB having the amount of information in mind that should finally be processed:
~ 100 million records (~ 2 billion statements)
That’s why, we decided to bring forward implementing the streaming variant that bypasses the Data Hub completely and, which was requested by other interested institutions (e.g. Dortmund University Library). Plain data processing without storing anything in between took only 429 seconds for the mentioned 1.2 million records data set. Nevertheless, there’s still a challenge left
d:swarm – a Challenge to Modern Databases?
So what’s it what we are looking for (ideally)?
- arbitrary data can be stored in a generic data format (without information loss)
- statements (simple subject-predicate-object sentences) are our atomic unit of information
- records (a collection of statements) are our main unit of processing (that can be flat or hierarchical)
- qualified attributes can be added to statements, e.g., order or provenance
- versioning can be done at a statement-level base
- the data hub can be partitioned semantically, e.g., all data that belong to a certain data model can be stored and retrieved rather easily
- an intuitive query language to work with the data
- an indexing engine to speed up read operations
- constraints can be defined to guarantee data integrity
- data is stored in an RDF-compatible format or can be converted to RDF rather easily
- ~ 100 million records (~ 2 billion statements) can be written within a day
A system consisting of one or more (, different) storage solutions (e.g. a Data Hub powered by polyglot persistence) that fulfil all these requirements and demands might be an opportunity or a re-evaluation of the use cases that should be possible with such a system may reduce one or more features.
d:swarm Data Hub in the Future
SLUB Dresden together with the d:swarm team will re-evaluate the requirements and demands that should be satisfied by a central data hub. Afterwards, improvements or a re-implementation of the Data Hub component can be planned. This can probably be done together with ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions). The people at ScaDS are interested in use cases for storing and processing large knowledge graphs, which fits perfectly to our feature list of or (ideal) Data Hub. We discussed our use case and the scability challenge at ScaDS project meeting in June 2015.
Besides an evaluation (and maybe cooperation) with other, existing projects (that arised and evolved from the beginning of the d:swarm project and later on) that needs to tackle similar challanges might be a good option. Amongst others, Wikidata is a very interesting project, since it is founded on a very similar (graph) data model (see this comparison). Most recently, they published a (graph) query service that is powered by a triple store (Blazegraph). Wikibase the data hub behind Wikidata has a notion of record and also supports qualified attributes at statements and statement-level based versioning. A database with rather promising features (and that’s why also a look worth ) is AsterixDB. It has an open and flexible data model (a superset of JSON) that fits good to our demand of a generic data format, and this system was developed and implemented with scale in mind.
Let’s see what the future will bring us to achieve the goal to have a central data hub for bibliographic records and related information.
PS: It looks like that we are not the only one fighting this challenge …