What can I do with d:swarm today?
With d:swarm in its current state you can …
- import and configure (example) CSV, XML and JSON data resources for further processing (see this overview of configuration options)
- create projects, define mappings, transformations (see a list of available functions) and filters
- transform data
- export data in JSON, RDF or XML (e.g., for feeding Solr indices).
Configuring data resources, creating mappings to target schemata and exporting the transformation result can be done with the d:swarm Back Office web application. See our getting started for a brief manual of how to utilise the d:swarm Back Office UI. Example data sets (i.e. small data sets) can directly be processed (completely) within the Back Office.
Besides you can …
- explore the graph data model of your imported and transformed data
- define skip filters (for whole job, i.e., only records that match the criteria of the filter will be emitted)
- search records (in the ingested data model)/ define record selection (to be able to execute preview/ test tasks @ Back Office with specific records)
- copy mappings (to new project with input data model with very similar schema, e.g., a schema that was generated by given example data to an inbuilt schema)
- migrate mappings (to new project with input data model with somehow similar schema, e.g., OAI-PMH MARCXML to MARCXML)
Batch-processing large amounts of data can be done with the Task Processing Unit for d:swarm (TPU). This part of the d:swarm data management platform was initially developed by UB Dortmund. You have the choice between two options – the Streaming and the Data Hub variant – when processing data with d:swarm via the TPU.
d:swarm Streaming Variant
The d:swarm Streaming variant offers fast processing of large amounts of data and is applicable for many scenarios. You can already utilise it today. The Streaming variant simply processes the source data into our generic data format when processing the transformations and directly outputs the transformation result (as XML or various RDF serialization formats). Unlike in the Data Hub variant this processing variant do not support versioning/archiving (i.e. the Data Hub is not involved in this processing). See how SLUB Dresden employs d:swarm for transforming and integrating bibliographic data sources.
d:swarm Data Hub Variant
The Data Hub variant intends to store all data (versioned) in a generic data format (incl. provenance etc.) in the Data Hub. Archiving and versioning of data is only possible with this processing variant. It could also form the basis for upcoming functionality such as deduplication, FRBR-ization and other data quality improvements. Currently, the Data Hub variant tackles some scalability issues. That’s why, we do not recommend to utilise it right now for large amounts of data.