@@ -124,7 +124,7 @@ Working with this information \Kristina{such datasets?} can be greatly facilitat
This shaping is done by inferring constraints over the data and validating it \Kristina{all nodes in the graph} based on these constraints. Validating a graph against constraints \Kristina{This can give}gives important insight into the structure of the data. For instance, when all nodes of a type conform to \Kristina{the given}constraints, then it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
\Kristina{The following paragraph feels a bit disconnected here. Maybe we could split introduction into "Introduction to Knowledge Graphs" and "Our framework" or something like this?}
Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps: fetching \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A future outlook is given in Section~\ref{section:future_work}. Finally, a conclusion of our work is provided in Section~\ref{section:conclusion}.
Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps: fetching a subset dataset from a \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A future outlook is given in Section~\ref{section:future_work}. Finally, a conclusion of our work is provided in Section~\ref{section:conclusion}.
\section{Related Work}
...
...
@@ -174,7 +174,7 @@ The framework was implemented in \emph{Java}. We used \emph{Maven} as a project
Because \emph{knowledge graphs} can be very large and contain many nodes, we concentrated on querying smaller subgraphs and only working on those. With this method, the relevant subgraph gets extracted from a \emph{knowledge graph} and can be worked upon in isolation.
We take our initial \emph{knowledge graphs} from the \emph{CommonCrawl} datasets and import them as a static file.
Figure~\ref{fig:query_construct_subgraph} shows the query we used to create this subgraph. At line 7 we used \emph{property paths}\footnote{\url{https://www.w3.org/TR/2013/REC-sparql11-query-20130321/\#propertypaths}} to query all nodes connected to those of an initial subset (lines 10 to 13). This subset can optionally be limited to a certain size, but is always limited to nodes of a certain type.
Figure~\ref{fig:query_construct_subgraph} shows the query we used to create a subgraph. At line 7 we used \emph{property paths}\footnote{\url{https://www.w3.org/TR/2013/REC-sparql11-query-20130321/\#propertypaths}} to query all nodes connected to those of an initial subset (lines 10 to 13). This subset can optionally be limited to a certain size, but is always limited to nodes of a certain type.
\begin{figure}
\centering
...
...
@@ -252,15 +252,16 @@ If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min
\end{figure}
\section{Evaluation}\label{section:evaluation}
In this section we evaluated of our tool. Therefore, we first explain the methodology in Section~\ref{section:methodology} and second measured the runtime of our tool with different input parameter in Section~\ref{section:runtime}. Further, we tested correctness of the genreated \emph{ShEx}-constraints and also cross validated them in Section~\ref{section:correctness}.
\Jamie{Introduction sentence.}
\subsection{Methodology}
\label{section:methodology}
For taking measurements, the application was started locally on our hardware. This was done to minimise side-effects of other applications running on the virtual machine where the live-instance is deployed. We used a machine with a \emph{Ryzen 9 3900x} CPU with 12x3.8GHz cores, DDR4 RAM and an SSD.
Additionally, the JVM was set up to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. \Kristina{This sentence is very long, maybe we can split it somehow?}
\subsection{Runtime}
\label{section:runtime}
Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show the measurements we obtained by changing the \emph{LIMIT} input parameter. This parameter limits the size of the start-node subset, from which connected nodes are queried. All the measurements are shown in Tables~\ref{table:runtimes_wo_limit} and \ref{table:runtimes_w_limit}.
\begin{figure}[h!]
...
...
@@ -334,7 +335,7 @@ Figure~\ref{fig:exec_times_no_limit} shows the runtime without limiting the cons
\end{figure}
\subsection{Correctness}
\label{section:correctness}
\subsubsection{ShEx Generation}
We thought \emph{Shexer}, which was already mentioned in Section~\ref{section:related_work}, was a good fit for cross validating our \emph{ShEx}-generation. However, due to our limited knowledge of operating this tool, we did not manage to generate proper constraints for our RiverBodyOfWater-dataset. Our attempt at using this tool is shown in Figure~\ref{code:shexer}, which generated the trivial, non-restrictive constraints shown in Figure~\ref{fig:shexer_output}.