Random Data Generator

The TopBraid products include a SHACL-based random data generator that can be used to produce graphs consisting of instances that follow the constraints defined in SHACL. The feature is available in TopBraid Composer via Model > Generate Random Triples... and in all TopBraid products using the SPARQLMotion module sml:GenerateRandomData.

To get started with this feature, open the file edg.topbraidlive.org\1.0\samples\datagen\data-assets-instances.ttl. This file is basically empty but it imports the file data-assets.ttl which contains instructions for the generator on what to produce as output. You can run the generator with the "instances" file open from the Model menu, and you will see many new instances, for example using the Find all locally defined resources button in the tool bar.

To produce random instances for your own data models, start by creating a new RDF file that imports your schema/ontology and the file TopBraid/SHACL/datagen.ttl. Let's call the RDF graph from this file the data generator spec graph. The schema must include SHACL property shape definitions so that the generator knows which properties it is supposed to produce values for. You can add triples to these schema definitions in your data generator spec graph, for example to narrow down the cardinalities or value types. For example, when you add a sh:class constraint to a class, and the generator creates instances of that class, then it will only produce links to instances of the specified class in those links.

To instruct the engine on which classes to instantiate, add the properties datagen:minInstanceCount and datagen:maxInstanceCount to these classes. For all classes that have such values, the engine will generate new instances as the first step. Each of these instances will get an rdfs:label. After that step, the engine will add triples that have the new instances as subject. The predicates of these triples will be selected from the values of sh:path in all property shapes associated with the type classes or its superclasses. The number of values of these properties depends on the defined sh:minCount and sh:maxCount constraints, but also on values of datagen:minValueCount and datagen:maxValueCount which can be added to existing property shapes (if they have a URI) or to new property shapes that are local to the data generator spec graph. It will generate values if a maximum count has been defined or a minimum count of at least 1. For object-valued properties, it will pick random values from the existing instances, including instances from the imports closure of the current graph but also newly created instances. For data-valued properties, it will try to generate suitable random values using some heuristics, and including some SHACL constraints such as sh:minLength and sh:maxInclusive.

The generator has an option to perform SHACL validation as part of the triple generation process. If activated, all new instances will be checked against the defined SHACL constraints. Values that violate constraints will then be attempted to be replaced with different random values. The engine will try this a few times but eventually give up if no suitable random value could be produced. This step may significantly slow down the execution time but will produce higher quality data.