@@ -184,11 +184,11 @@ We take our initial \emph{knowledge graphs} from the \emph{CommonCrawl} datasets
\subsection{Generating Constraints}
\label{generatingconstraints}
To shape a \emph{knowledge graph} we need to infer constraints on the previously fetched subgraph.
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister2018}\cite{werkmeister_rdf2graph_git}. \Valerian{(this is the old version, we use the fork)}\Kristina{I adapted the sentence, do you think this works now, Valerian?} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a \emph{knowledge graph} from \emph{CommonCrawl}\Jamie{Why don't we have the fetched KG as input?}\Danielle{Jamie makes a great point}. The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax.\Valerian{RDF2Graph offers a tool to export the constraints to ShEx syntax.}\Kristina{I adapted the sentence, do you think this is okay now, Valerian?}
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister2018}\cite{werkmeister_rdf2graph_git} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a \emph{knowledge graph} from \emph{CommonCrawl}\Jamie{Why don't we have the fetched KG as input?}\Danielle{Jamie makes a great point}. The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax.
\missingfigure{ add query to graph (chosen by Philipp), e.g. multiplicy of argument etc.}
%\subsubsection{Integrating RDF2Graph with our framework}
We implemented the following steps in order to integrate \emph{RDF2Graph} into our project. We added \emph{RDF2Graph} to our framework so that they could be compiled together\Valerian{, and in the process minimally updated it to be compatible with our version of Java and Jena}. \Kristina{I would leave out the minimally, no need to downplay the work you did on this here, I think. Apart from that, I like it, feel free to put it in} In addition, we changed some of the initial parameters of the \emph{RDF2Graph}, since it originally was intended as a stand-alone application. As we are handling \emph{Models}\todo{Add explanation for Model? Maybe in glossary?} in our software, we changed the input from a \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a \emph{Model} needed to be created by \emph{RDF2Graph}, now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work \todo{should we explain this in more detail?}. As we only work on one Model at a time, we only use one output graph.
We implemented the following steps in order to integrate \emph{RDF2Graph} into our project. We added \emph{RDF2Graph} to our framework so that they could be compiled together\Valerian{, and in the process minimally updated it to be compatible with our version of Java and Jena}. \Kristina{I would leave out the minimally, no need to downplay the work you did on this here, I think. Apart from that, I like it, feel free to put it in}\Valerian{@Kristina:We only changed so much that it works, but did not rewrite old, possibly deprecated parts that did not lead to a noticable error.} In addition, we changed some of the initial parameters of the \emph{RDF2Graph}, since it originally was intended as a stand-alone application. As we are handling \emph{Models}\todo{Add explanation for Model? Maybe in glossary?} in our software, we changed the input from a \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a \emph{Model} needed to be created by \emph{RDF2Graph}, now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work \todo{should we explain this in more detail?}. As we only work on one Model at a time, we only use one output graph.
\todo{Add explanation of limit to this section?}
...
...
@@ -199,7 +199,7 @@ Given a \emph{RDF} graph and a set of constraints, the validation consists of ve
For the implementation of this process, an \emph{RDF} subgraph and \emph{Shex} constraints are required as input. Then, we use this to generate a \emph{shape map}, which contains all of the types that need to be validated. For the actual validation, the \emph{ShExValidator} provided by the \emph{Jena} library was used. \todo{add reference to Jena library here?} The validator requires a set of constraints defined in valid \emph{ShEx} syntax and a \emph{shape map}.
\Danielle{the following is repetetive, should be removed.}
The \emph{shape map} describes which types of nodes need to be validated against which \emph{ShEx} constraint definitions. \Valerian{We construct the shape map depending on the types available in the subgraph. See Figure \ref{fig:shape_map} for an example.}
The \emph{shape map} describes which types of nodes need to be validated against which \emph{ShEx} constraint definitions. \Valerian{We construct the shape map depending on the types of nodes in the subgraph. See Figure \ref{fig:shape_map} for an example.}
\begin{figure}
\centering
...
...
@@ -238,7 +238,7 @@ For taking measurements, the application was started locally on our hardware. \D
This was done to minimise side-effects of other applications running on the virtual machine where the live-instance is deployed. Additionally, the JVM was set up to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. \Kristina{This sentence is very long, maybe we can split it somehow?}
\subsection{Runtime}
The measurements were taken on a local machine, which uses a \emph{Ryzen 9 3900x} CPU with 12x3.8GHz processors, DDR4 RAM and an SSD. The subgraphs fit into memory.
The measurements were taken on a local machine, which uses a \emph{Ryzen 9 3900x} CPU with 12x3.8GHz cores, DDR4 RAM and an SSD. The subgraphs fit into memory.
Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show the measurements we obtained by changing the \emph{LIMIT} input parameter. This parameter limits the size of the start-node subset, from which connected nodes are queried. All the measurements are shown in Tables~\ref{table:runtimes_wo_limit} and \ref{table:runtimes_w_limit}.
...
...
@@ -321,7 +321,7 @@ Therefore, we checked the generated constraints manually for small subgraphs (se
Firstly, if the dataset consists of only stand-alone blank nodes, as seen in Figure~\ref{code:blank_nodes}, then \emph{Rdf2Graph} does not infer any \emph{ShEx}-constraints. This was the case for the generated subgraph using RiverBodyOfWater with a \emph{LIMIT} of 50, and the resulting \emph{ShEx} can be seen in Figure~\ref{fig:shex_rbow_50}.
Secondly, optional properties are not always inferred and therefore missing from the generated \emph{ShEx}-constraints. This also happens for unlimited subgraphs (see Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit}), with the exception of the RiverBodyOfWater-RDFtype, where it looks like the constraints are complete, \Kristina{Would put a full stop here and start new sentence "However,..."} however due to the large graph manually checking for correctness is infeasible. We did not see a correlation between missing constraint-properties and the shape of the graph.
Secondly, optional properties are not always inferred and therefore missing from the generated \emph{ShEx}-constraints. This also happens for unlimited subgraphs (see Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit}), with the exception of the RiverBodyOfWater-RDFtype, where it looks like the constraints are complete. However, due to the size of the graph manually checking for correctness is infeasible. We did not see a correlation between missing constraint-properties and the shape of the graph.