removed some whitespace

dce168e0 · Kristina Magnussen · 8637361f · dce168e0 · dce168e0 · dce168e0
Commit dce168e0 authored 3 years ago by Kristina Magnussen
--- a/CTCS_report_template/CTCS_template.pdf
+++ b/CTCS_report_template/CTCS_template.pdf
--- a/CTCS_report_template/CTCS_template.tex
+++ b/CTCS_report_template/CTCS_template.tex
@@ -129,7 +129,6 @@ For this reason, we implemented a framework for shaping \emph{knowledge graphs}.
 \section{Related Work}
 \label{section:related_work}

-
 The need for automatic tools that are able to infer meta information on the structure of \emph{knowledge graphs} has already been recognized by different researchers. This stems from the fact that manual constraint inference becomes infeasible for large datasets.

 One tool which can be used to automatically infer constraints over a \emph{knowledge graph} is \emph{RDF2Graph} \cite{vanDam2015,original_rdf2graph_git}. Our framework makes use of an adapted version of this tool by Werkmeister \cite{werkmeister2018,werkmeister_rdf2graph_git}, which uses several \emph{SPARQL} queries to gather the structural information of each node in the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used the \emph{RDF2Graph} tool on the \emph{UniProt RDF} \cite{uniprot} resource, Werkmeister made adaptions to also infer constraints on Wikidata. To achieve this, the updated \emph{RDF2Graph} is able to work with (preselected and -fetched) local datasets and provides better performance with larger datasets, due to adapted simplification steps. Furthermore, Werkmeister added support for cyclic type hierarchies and performs an additional reduction step on the schema once the constraint inference is done, improving the performance of validations.
@@ -152,7 +151,7 @@ The repository also includes a README file describing how to set-up and install

 %Our framework offers a way to evaluate a \emph{knowledge graph} in an automated way. For this, we used \emph{knowledge graphs} from the \emph{CommonCrawl} datasets as a basis. The \emph{knowledge graphs} are imported as a static file. After this, our framework infers constraints over this data set (see Section~\ref{generatingconstraints}). These are validated automatically in the last step, see Section~\ref{validatingconstraints}. The user can interact with this framework over the front-end, see Section~\ref{frontend}. These different steps were implemented and tested separately. Once this was done, we consolidated them. The structure of our project can be seen in Fig.~\ref{fig:uml}. \todo{update figure}

-\begin{figure}[ht]
+\begin{figure}[htp]
 	\centering
 	\includegraphics[scale=0.35]{kg_shapes_uml.pdf}
 	\caption{UML diagram of the framework structure}
@@ -173,7 +172,7 @@ We take our initial \emph{knowledge graphs} from the \emph{CommonCrawl} datasets

 Figure~\ref{fig:query_construct_subgraph} shows the query we used to create a subgraph. At line 7 we used \emph{property paths}\footnote{\url{https://www.w3.org/TR/2013/REC-sparql11-query-20130321/\#propertypaths}} to query all nodes connected to those of an initial subset (lines 10 to 13). This subset can optionally be limited to a certain size, but is always limited to nodes of a certain type.

-\begin{figure}
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/subgraph.sparql}
    \caption{The \emph{SPARQL}-query creating the subgraph. The \emph{\%s} get substituted before executing the query.}
@@ -201,21 +200,20 @@ Given a \emph{RDF} graph and a set of constraints, the validation consists of ve

 For the implementation of this process, an \emph{RDF} subgraph and \emph{ShEx} constraints are required as input. Then, we use this to generate a \emph{shape map}, which contains all of the types that need to be validated. For the actual validation, the \emph{ShExValidator} provided by the \emph{Jena} library was used. The validator requires a set of constraints defined in valid \emph{ShEx} syntax and a \emph{shape map}. We query the subgraph for its types of nodes (see Figure~\ref{fig:shape_map_query}), and construct the \emph{shape map} from that. Figure~\ref{fig:shape_map} shows an example.

-\begin{figure}
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_null.shapemap_query.sparql}
    \caption{The very simple query getting the different types to be used in the \emph{shape map}.}
    \label{fig:shape_map_query}
 \end{figure}

-\begin{figure}
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_null.shapemap}
    \caption{The shape map used for validating the full subgraph starting with RiverBodyOfWater-nodes.}
    \label{fig:shape_map}
 \end{figure}

-
 The class \emph{ShexValidationRecord} stores the result of the validation for each node of the graph. Additionally, the percentage of nodes that conform to their constraints is calculated and stored.

 \subsection{Front-end}
@@ -223,24 +221,24 @@ The class \emph{ShexValidationRecord} stores the result of the validation for ea
 We implemented a front-end where the user can choose a \emph{knowledge graph} as well as its type (see Figure~\ref{fig:frontend}). In addition, the user can also set a limit on the number of nodes of the specified type that they wish to have constraints generated for. As output (see Figure~\ref{fig:frontend_shex}), \emph{ShEx} constraints as well as a validation of the subgraph against those constraints are returned. The constraints can be edited by the user and the selected subgraph can be re-validated against the newly edited constraints.
 If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min=1): 0". The user can download the subgraph that was validated. The interaction between user, front-end and server can also be seen in Figure~\ref{fig:sequence}. The code for the front-end can be found in our git repository \cite{git_shapes_frontend}.

-\begin{figure}[h]
+\begin{figure}[htp]
 	\centering
-	\includegraphics[scale=0.18]{frontend/frontend_edit_done.png}
-	\caption{The frontend, showing the calculated \emph{ShEx}-constraints and validation-results.}
-	\label{fig:frontend_shex}
+	\includegraphics[scale=0.18]{frontend/frontend_edit_cropped.png}
+	\caption{The frontend, showing a selection of dataset, RDFType, and \emph{LIMIT} of starting-nodes.}
+	\label{fig:frontend}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
 	\centering
-	\includegraphics[scale=0.18]{frontend/frontend_edit.png}
-	\caption{The frontend, showing a selection of dataset, RDFType, and \emph{LIMIT} of starting-nodes.}
-	\label{fig:frontend}
+	\includegraphics[scale=0.20]{frontend/frontend_edit_done.png}
+	\caption{The frontend, showing the calculated \emph{ShEx}-constraints and validation-results.}
+	\label{fig:frontend_shex}
 \end{figure}


-\begin{figure}[h]
+\begin{figure}[htp]
 	\centering
-	\includegraphics[scale=0.5]{kgshapes_sequence.pdf}
+	\includegraphics[scale=0.6]{kgshapes_sequence.pdf}
 	\caption{Sequence diagram showing the interaction between web application, user and server}
 	\label{fig:sequence}
 \end{figure}
@@ -260,7 +258,7 @@ Additionally, the JVM was set up to use up to 16 GB of main memory for its heap
 \label{section:runtime}
 Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show the measurements we obtained by changing the \emph{LIMIT} input parameter. This parameter limits the size of the start-node subset, from which connected nodes are queried. All the measurements are shown in Tables~\ref{table:runtimes_wo_limit} and \ref{table:runtimes_w_limit}.

-\begin{figure}[h!]
+\begin{figure}[htp]
 \begin{subfigure}{\textwidth}
 \centering
 \includegraphics[width=0.65\linewidth]{img/limit_legend.pdf}
@@ -290,7 +288,7 @@ Secondly, the runtime of constructing the subgraph scales with the \emph{LIMIT}.
 To understand the behaviour shown in Figure~\ref{fig:limit_rbow}, we want to look at Figure~\ref{fig:exec_times_per_triples}, which shows the same runtimes, but grouped by the number of triples in the subgraph on which the constraints are created. As opposed to Figures~\ref{fig:triple_canal} and \ref{fig:triple_service}, the maximum number of triples (shown in the x-coordinate in Figure~\ref{fig:triple_rbow}), is 1769.
 This is also the amount of triples contained in the subgraph that we get without providing any limit. Therefore, providing a limit larger than 200 won't enrich the constructed graph, keeping the time almost constant in regards to the \emph{LIMIT} parameter.

-\begin{figure}[h!]
+\begin{figure}[htp]
 \begin{subfigure}{\textwidth}
 \centering
 \includegraphics[width=0.65\linewidth]{img/triple_legend.pdf}
@@ -317,7 +315,7 @@ This is also the amount of triples contained in the subgraph that we get without

 Figure~\ref{fig:exec_times_no_limit} shows the runtime without limiting the construction of the subgraph. Note the much larger runtime needed for querying the graph, despite resulting in the same amount of triples when providing a large enough \emph{LIMIT}.

-\begin{figure}[h!]
+\begin{figure}[htp]
 \begin{subfigure}{\textwidth}
 \centering
 \includegraphics[width=0.65\linewidth]{img/no_limit_legend.pdf}
@@ -341,12 +339,11 @@ First of all, if the dataset consists of only stand-alone blank nodes, as seen i

 Secondly, optional properties are not always inferred and therefore missing from the generated \emph{ShEx}-constraints. This also happens for unlimited subgraphs (see Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit}), with the exception of the RiverBodyOfWater-RDFType, where it looks like the constraints are complete. However, due to the size of the graph, manually checking for correctness is infeasible. We did not see a correlation between missing constraint-properties and the shape of the graph.

-
 \subsubsection{ShEx Validation}

 The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape} \cite{RDFShape}. The validation result was the same as in our tool.

-\begin{figure}[h]
+\begin{figure}[htp]
 \centering
 \lstinputlisting{code_snippets/blank_nodes.ttl}
 \caption{Blank Nodes in Turtle File}
@@ -356,7 +353,6 @@ The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMI
 \section{Results} \label{section:results}
 Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. In addition, the user can also edit the constraints. Our evaluation in Section~\ref{section:evaluation} showed that the validation of a subgraph works as expected. However, the constraints generated are prone to missing an optional \emph{url} attribute and would benefit from more tests and tuning. In addition, we could also see that the runtime of the tool is rather slow. Possible improvements to our framework are discussed in more depth in Section~\ref{section:future_work}. Generally, our tool works as intended, even though there is still some room for improvement. 

-
 \section{Conclusion}
 \label{section:conclusion}
 %\Jamie{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?) + add chellenges to respective section}
@@ -407,56 +403,56 @@ This appendix shows \emph{ShEx}-output and tables of data that were too large to
    \item Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit} show our generated \emph{ShEx} for subgraphs without any \emph{LIMIT}.
 \end{itemize}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/shexer_out.shex}
    \caption{Shexer output}
    \label{fig:shexer_output}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting[language=python]{code_snippets/shexer.py}
    \caption{Running shexer on the full graph}
    \label{code:shexer}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Canal_50.shex}
    \caption{Generated \emph{ShEx}-constraints of Canal with \emph{LIMIT} 50}
    \label{fig:shex_canal_50}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_50.shex}
    \caption{Generated \emph{ShEx}-constraints of RiverBodyOfWater with \emph{LIMIT} 50}
    \label{fig:shex_rbow_50}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Service_50.shex}
    \caption{Generated \emph{ShEx}-constraints of Service with \emph{LIMIT} 50}
    \label{fig:shex_service_50}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Canal_null.shex}
    \caption{Generated \emph{ShEx}-constraints of Canal without a \emph{LIMIT}}
    \label{fig:shex_canal_wo_limit}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_null.shex}
    \caption{Generated \emph{ShEx}-constraints of RiverBodyOfWater without a \emph{LIMIT}}
    \label{fig:shex_rbow_wo_limit}
 \end{figure}

-\begin{figure}[h]
+\begin{figure}[htp]
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Service_null.shex}
    \caption{Generated \emph{ShEx}-constraints of Service without a \emph{LIMIT}}

--- a/CTCS_report_template/img/frontend/frontend_edit_cropped.png
+++ b/CTCS_report_template/img/frontend/frontend_edit_cropped.png