Skip to content
Snippets Groups Projects
CTCS_template.tex 33.7 KiB
Newer Older
Kristina Magnussen's avatar
Kristina Magnussen committed
% !TEX encoding = UTF-8 Unicode
\documentclass[10pt,parskip=half]{scrartcl}

\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{amssymb,amsmath}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage{epstopdf}
\usepackage{xspace}
\usepackage{lmodern}
\usepackage{textcomp}
\usepackage{bm}
\usepackage{layouts}
Kristina Magnussen's avatar
Kristina Magnussen committed
\usepackage{enumitem}
\usepackage{multirow}
\usepackage[table,xcdraw]{xcolor}
\usepackage[percent]{overpic}
\usepackage{tikz}
\usepgflibrary{arrows}% for more options on arrows
\usepackage[colorinlistoftodos]{todonotes}
\usepackage{color,soul}
\usepackage{booktabs}
\usepackage{todonotes}
Kristina Magnussen's avatar
Kristina Magnussen committed
\usepackage[nottoc]{tocbibind} %to make the references appear in the ToC
\usepackage{subcaption}
User expired's avatar
User expired committed
\usepackage{placeins}
\usepackage{hyperref} % For clickable references/links
\graphicspath{{./img/}}
Kristina Magnussen's avatar
Kristina Magnussen committed
	% to mark ideas for the text
\newcommand{\Kristina}[1]{\textcolor{violet}{\textbf{\textit{<#1>}}}}
\newcommand{\Danielle}[1]{\textcolor{blue}{\textbf{\textit{<#1>}}}}
\newcommand{\Jamie}[1]{\textcolor{green}{\textbf{\textit{<#1>}}}}
\newcommand{\Philipp}[1]{\textcolor{purple}{\textbf{\textit{<#1>}}}}
\newcommand{\Valerian}[1]{\textcolor{orange}{\textbf{\textit{<#1>}}}}

Kristina Magnussen's avatar
Kristina Magnussen committed

\usepackage{lipsum}

\def\bf{\bfseries}

% start: code settings
\usepackage{listings}
\usepackage{xcolor}

\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}

\lstdefinestyle{mystyle}{
    backgroundcolor=\color{backcolour},   
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\footnotesize,
    breakatwhitespace=false,         
    breaklines=true,                 
    captionpos=b,                    
    keepspaces=true,                 
    numbers=left,                    
    numbersep=5pt,                  
    showspaces=false,                
    showstringspaces=false,
    showtabs=false,                  
    tabsize=1
}

\lstset{style=mystyle}

% end: code settings


Kristina Magnussen's avatar
Kristina Magnussen committed
\begin{document}

%----------------------------------------------------------------------------------------
% Title page
%----------------------------------------------------------------------------------------
	
	\begin{center}\Large
		\includegraphics[width=5cm]{uibk_logo_4c_cmyk.pdf}\\[5mm]
		
		\begin{large}\bf
			PS 703301 --  WS 2021/22\\
			Current Topics in Computer Science\\[5mm]
		\end{large}
		Final Report\\[20mm]
Kristina Magnussen's avatar
Kristina Magnussen committed
		{\titlefont \huge Shaping Knowledge Graphs}\\[20mm]
Kristina Magnussen's avatar
Kristina Magnussen committed
		 
		Philipp Gritsch \\
		Jamie Hochrainer \\
		Kristina Magnussen \\
		Danielle McKenney\\
		Valerian Wintner\\[35mm]
		
		supervised by\\
		M.Sc. Elwin	Huaman\\
		\vfill
	\end{center}
\thispagestyle{empty}	
\pagebreak
% ------------------------------------------------------------------------
\tableofcontents
\pagebreak
% ------------------------------------------------------------------------
\todo{remove this part} 
I added a comment colour for everyone. 
\Jamie{Comment colour Jamie}
\Danielle{Comment colour Danielle}
\Philipp{Comment colour Philipp}
\Valerian{Comment colour Valerian}
\Kristina{Comment colour Kristina}

User expired's avatar
User expired committed
%\Philipp{some small abstract idea:}
%With rising interest in Knowledge Graphs and their use, the necessity of maintaining the structured information emerges. \Kristina{maybe "necessity of obtaining the structure of the information stored" instead?} This can be achieved by ...  Our tool aims to integrate all the essential processes into one framework.
Kristina Magnussen's avatar
Kristina Magnussen committed
\section{Introduction}
\label{introduction}
User expired's avatar
User expired committed
With the massive amount of data available on the internet, which is growing every day, a convenient, flexible, and efficient way of storing data becomes more and more important. In addition, concrete objects, abstract ideas as well as connections and relationships between entities have to be represented.
This is where \emph{knowledge graphs} become important. There exist \Kristina{are instead of exist?} various definitions of \emph{knowledge graphs} but as the name indicates, they are basically a knowledge model that is structured as a graph. \Kristina{I would maybe shorten this sentence to just "Knowledge graph structure data in the form of a graph.} That knowledge model \Kristina{This graph?}can contain types, entities, literals, and relationships. A \emph{knowledge graph} can make it easier to find and process facts in which one might be interested. \Kristina{I think I would write something like "A knowledge graph allows for flexible data structures and can make it easier to find and process relevant data.} However, \Kristina{the datasets stored in such a graph are often inconsistent and prone to containing errors.}large datasets are often inconsistent and are prone to containing errors.
Working with this information \Kristina{such datasets?} can be greatly facilitated by defining a consistent shape for the data, based on the type of entity or idea \Kristina{I think I would leave out "idea", since it's less technical} it represents.
This shaping is done by inferring constraints over the data and validating it \Kristina{all nodes in the graph} based on these  constraints. Validating a graph against constraints \Kristina {This can give}gives important insight into the structure of the data. For instance, when all nodes of a type conform to \Kristina{the given}constraints, then it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
\Kristina{The following paragraph feels a bit disconnected here. Maybe we could split introduction into "Introduction to Knowledge Graphs" and "Our framework" or something like this?}
Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps:  fetching a subset dataset from a \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A future outlook is given in Section~\ref{section:future_work}. Finally, a conclusion of our work is provided in Section~\ref{section:conclusion}.
Kristina Magnussen's avatar
Kristina Magnussen committed

\section{Related Work}
\label{section:related_work}
The need for automatic tools that are able to infer meta information on the structure of knowledge graphs has already been recognized by different researchers. This stems from the fact that manual constraint inference becomes infeasible for large datasets.
\Danielle{the citing here looks strange...}
Our framework makes use of an adapted version by Werkmeister  ~\cite{werkmeister2018,werkmeister_rdf2graph_git} of \emph{RDF2Graph}~\cite{vanDam2015,original_rdf2graph_git}, which uses different \emph{SPARQL} queries to gather the structural information of each node of the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used \emph{RDF2Graph} tool on the UniPort RDF resource, Werkmeister made adaptions to also infer Wikidata constraints.
Fernandez-{Álvarez} et al. have taken a different approach with their tool \emph{Shexer}~\cite{Shexer}. In contrast to the aforementioned tool, they avoid querying the whole underlying graph by using an iterative approach, determining whether or not the current iterated (sub-)set of triples is relevant for the constraint generation process. Given a target shape, the preselected triples are used to decorate each target instance with its constraints.
Another constraint generator has been introduced by Spahiu et al. with \emph{ABSTAT}~\cite{Spahiu2016ABSTATOL}. This \Kristina{tool?}uses an approach similar to that of \emph{RDF2Graph} by collecting structural information using \emph{SPARQL} queries and summarizing those constraints afterwards.
Kristina Magnussen's avatar
Kristina Magnussen committed
%You may add any subsections you deem appropriate for your specific project. Some examples for your reference: Technology stack, Training strategy, Data, Experiments, etc.
To construct a framework that offers a way to evaluate a \emph{knowledge graph} in an automated way, we divided our project into three main subtasks.
Valerian Wintner's avatar
Valerian Wintner committed
At first, we fetch a subgraph of a \emph{knowledge graph} from the \emph{CommonCrawl} datasets, as explained in greater detail in Section~\ref{section:fetchingKG}.
After this, our framework infers constraints over this data set (see Section~\ref{generatingconstraints}).
In the last step, the subgraph is validated against the constraints (see Section~\ref{validatingconstraints}).
The structure of the framework can be seen in Fig.~\ref{fig:uml}.
User expired's avatar
User expired committed
The user can interact with this framework over the front-end (see Section~\ref{frontend}).\\
The framework can be found in our git repository \cite{git_shapes}.
The repository also includes a README file describing how to set-up and install the project.
%Our framework offers a way to evaluate a \emph{knowledge graph} in an automated way. For this, we used \emph{knowledge graphs} from the \emph{CommonCrawl} datasets as a basis. The \emph{knowledge graphs} are imported as a static file. After this, our framework infers constraints over this data set (see Section~\ref{generatingconstraints}). These are validated automatically in the last step, see Section~\ref{validatingconstraints}. The user can interact with this framework over the front-end, see Section~\ref{frontend}. These different steps were implemented and tested separately. Once this was done, we consolidated them. The structure of our project can be seen in Fig.~\ref{fig:uml}. \todo{update figure}

\begin{figure}[ht]
	\centering
	\includegraphics[scale=0.35]{kg_shapes_uml.pdf}
	\caption{UML diagram of the framework structure}
	\label{fig:uml}
\end{figure}
\subsection{Technology Stack}
In this section, we summarise the main technologies that we used in this project.
The framework was implemented in \emph{Java}. We used \emph{Maven} as a project management tool. We also used java framework \emph{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue3}\cite{Vue} as a front-end framework and \emph{PrimeVue} as a library for the different UI components. For the deployment of our application we used a single virtual machine. Access to the front-end is done via a single \emph{Apache} server. The front-end accesses the back-end via an internal \emph{REST-API}.
User expired's avatar
User expired committed
\FloatBarrier

\subsection{Constructing Subgraph}
Valerian Wintner's avatar
Valerian Wintner committed
\label{section:fetchingKG}
User expired's avatar
User expired committed
Because \emph{knowledge graphs} can be very large and contain many nodes, we concentrated on querying smaller subgraphs and only working on those. With this method, the relevant subgraph gets extracted from a \emph{knowledge graph} and can be worked upon in isolation.
We take our initial \emph{knowledge graphs} from the \emph{CommonCrawl} datasets and import them as a static file.
Figure~\ref{fig:query_construct_subgraph} shows the query we used to create a subgraph. At line 7 we used \emph{property paths}\footnote{\url{https://www.w3.org/TR/2013/REC-sparql11-query-20130321/\#propertypaths}} to query all nodes connected to those of an initial subset (lines 10 to 13). This subset can optionally be limited to a certain size, but is always limited to nodes of a certain type.
Valerian Wintner's avatar
Valerian Wintner committed

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/subgraph.sparql}
    \caption{The \emph{SPARQL}-query creating the subgraph. The \emph{\%s} get substituted before executing the query.}
    \label{fig:query_construct_subgraph}
\end{figure}

User expired's avatar
User expired committed
\FloatBarrier
Kristina Magnussen's avatar
Kristina Magnussen committed
\subsection{Generating Constraints}
\label{generatingconstraints}
To shape a \emph{knowledge graph} we need to infer constraints on the previously fetched subgraph.
User expired's avatar
User expired committed
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister2018}\cite{werkmeister_rdf2graph_git} \Jamie{display references nicer} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a constructed subgraph as described in Section ~\ref{section:fetchingKG}. 
The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax.
%\subsubsection{Integrating RDF2Graph with our framework}
User expired's avatar
User expired committed
We implemented the following steps in order to integrate \emph{RDF2Graph} into our project. 
We added \emph{RDF2Graph} to our framework so that they could be compiled together, and in the process updated it as much as was needed to be compatible with our version of Java and Jena. In addition, we changed some of the initial parameters of the \emph{RDF2Graph}, since it originally was intended as a stand-alone application. 
As we are handling \emph{Models} in our software, we changed the input from a \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a \emph{Model} needed to be created by \emph{RDF2Graph}, now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is  provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work. As we only work on one Model at a time, we only use one output graph.
Kristina Magnussen's avatar
Kristina Magnussen committed
\subsection{Validating Constraints}
\label{validatingconstraints}
Given a \emph{RDF} graph and a set of constraints, the validation consists of verifying that every node in the graph fulfils the requirements given in the constraints. A graph may contain node with different types. Each of those types must conform to its corresponding definition outlined in the constraints. The result of the validation is a multidimensional list containing every node's id, a boolean flag, and an optional 'reason' entry. The boolean flag indicates whether or not the node conforms to its type's constraints. In case of nonconformity, a reason will be given.
User expired's avatar
User expired committed
For the implementation of this process, an \emph{RDF} subgraph and \emph{ShEx} constraints are required as input. Then, we use this to generate a \emph{shape map}, which contains all of the types that need to be validated. For the actual validation, the \emph{ShExValidator} provided by the \emph{Jena} library was used \cite{Jena}. The validator requires a set of constraints defined in valid \emph{ShEx} syntax and a \emph{shape map}.
\Danielle{the following is repetetive, should be removed.}
Valerian Wintner's avatar
Valerian Wintner committed
The \emph{shape map} describes which types of nodes need to be validated against which \emph{ShEx} constraint definitions. \Valerian{We query the subgraph for its types of nodes (See Figure~\ref{fig:shape_map_query}), and construct the \emph{shape map} from that. Figure~\ref{fig:shape_map} shows an example.}
Valerian Wintner's avatar
Valerian Wintner committed

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_null.shapemap_query.sparql}
Valerian Wintner's avatar
Valerian Wintner committed
    \caption{The very simple query getting the different types to be used in the \emph{shape map}.}
Valerian Wintner's avatar
Valerian Wintner committed
    \label{fig:shape_map_query}
\end{figure}
Valerian Wintner's avatar
Valerian Wintner committed

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_null.shapemap}
    \caption{The shape map used for validating the full subgraph starting with RiverBodyOfWater-nodes.}
    \label{fig:shape_map}
\end{figure}

\Danielle{Reformulate the following paragraph as follows:
The class \emph{ShexValidationRecord} stores the result of the validation for each node of the graph. Additionally, the percentage of nodes that conform to their constraints is calculated and stored.}
The class \emph{ShexValidationRecord} stores the result of the validation for every single node of the graph. Not only is the individual result of every node checked against its relevant constraints, but we also calculate the percentage of nodes that conform to their constraints.

\subsection{Front-end}
\label{frontend}
User expired's avatar
User expired committed
We implemented a front-end where the user can choose a \emph{knowledge graph} as well as its type (see Figure~\ref{fig:frontend}). In addition, the user can also set a limit on the number of nodes of the specified type that they wish to have constraints generated for. As output (see Figure~\ref{fig:frontend_shex}), \emph{ShEx} constraints as well as a validation of the subgraph against those constraints are returned. The constraints can be edited by the user and the selected subgraph can be re-validated against the newly edited constraints.
If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min=1): 0". The user can download the subgraph that was validated. The interaction between user, front-end and server can also be seen in Fig.~\ref{fig:sequence}.
Valerian Wintner's avatar
Valerian Wintner committed
\begin{figure}[ht]
	\centering
	\includegraphics[scale=0.18]{frontend/frontend_edit.png}
	\caption{The frontend, showing a selection of dataset, RDFtype, and \emph{LIMIT} of starting-nodes.}
	\label{fig:frontend}
\end{figure}

\begin{figure}[h!]
	\centering
	\includegraphics[scale=0.18]{frontend/frontend_edit_done.png}
	\caption{The frontend, showing the calculated \emph{ShEx}-constraints and validation-results.}
	\label{fig:frontend_shex}
\end{figure}


\begin{figure}[ht]
	\centering
	\includegraphics[scale=0.5]{kgshapes_sequence.pdf}
	\caption{Sequence diagram showing the interaction between web application, user and server}
	\label{fig:sequence}
\end{figure}

In this section we evaluated of our tool. Therefore, we first explain the methodology in Section~\ref{section:methodology} and second measured the runtime of our tool with different input parameter in Section~\ref{section:runtime}. Further, we tested correctness of the genreated \emph{ShEx}-constraints and also cross validated them in Section~\ref{section:correctness}.
\subsection{Methodology}
\label{section:methodology}
Valerian Wintner's avatar
Valerian Wintner committed
For taking measurements, the application was started locally on our hardware. This was done to minimise side-effects of other applications running on the virtual machine where the live-instance is deployed. We used a machine with a \emph{Ryzen 9 3900x} CPU with 12x3.8GHz cores, DDR4 RAM and an SSD.
Additionally, the JVM was set up to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. \Kristina{This sentence is very long, maybe we can split it somehow?}
\label{section:runtime}
Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show the measurements we obtained by changing the \emph{LIMIT} input parameter. This parameter limits the size of the start-node subset, from which connected nodes are queried. All the measurements are shown in Tables~\ref{table:runtimes_wo_limit} and \ref{table:runtimes_w_limit}.
\begin{figure}[h!]
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.65\linewidth]{img/limit_legend.pdf}
\end{subfigure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=0.9\linewidth]{img/Canal_limit.pdf}
\caption{RDFType = Canal}
\label{fig:limit_canal}
\end{subfigure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=0.9\linewidth]{img/RiverBodyOfWater_limit.pdf}
\caption{RDFType = RiverBodyOfWater}
\label{fig:limit_rbow}
\end{subfigure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=0.9\linewidth]{img/Service_limit.pdf}
\caption{RDFType = Service}
\label{fig:limit_service}
\end{subfigure}
\caption{Execution times per RDFType, per size of start-node subset on RiverBodyOfWater dataset}
\label{fig:exec_times_per_limit}
\end{figure}

Kristina Magnussen's avatar
Kristina Magnussen committed
The results shown in Figure~\ref{fig:exec_times_per_limit} were to be expected. First of all, the runtime of constructing the desired subset of the graph is considerably larger than the time needed to create the \emph{ShEx} constraints, or to validate the constraints on the graph.
Secondly, the runtime of constructing the subgraph scales with the \emph{LIMIT}. This becomes especially evident in Figures~\ref{fig:limit_canal} and \ref{fig:limit_service}.
To understand the behaviour shown in Figure~\ref{fig:limit_rbow}, we want to look at Figure~\ref{fig:exec_times_per_triples}, which shows the same runtimes, but grouped by the number of triples in the subgraph, on which the constraints are created. Unlike in Figures~\ref{fig:triple_canal} and \ref{fig:triple_service} the maximum number of triples (shown in the x-coordinate in Figure~\ref{fig:triple_rbow}), is 1769.
User expired's avatar
User expired committed
This is also the amount of triples contained in the subgraph that we get without providing any limit. Therefore, providing a limit larger than 200 won't enrich the constructed graph, keeping the time almost constant in regards to the \emph{LIMIT} parameter.

\begin{figure}[h!]
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.65\linewidth]{img/triple_legend.pdf}
\end{subfigure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=0.9\linewidth]{img/Canal_triple.pdf}
\caption{RDFType = Canal}
\label{fig:triple_canal}
\end{subfigure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=0.9\linewidth]{img/RiverBodyOfWater_triple.pdf}
\caption{RDFType = RiverBodyOfWater}
\label{fig:triple_rbow}
\end{subfigure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=0.9\linewidth]{img/Service_triple.pdf}
\caption{RDFType = Service}
\label{fig:triple_service}
\end{subfigure}

\caption{Execution times per RDFType, per number of triples on RiverBodyOfWater dataset}
\label{fig:exec_times_per_triples}
Valerian Wintner's avatar
Valerian Wintner committed
Figure~\ref{fig:exec_times_no_limit} shows the runtime without limiting the construction of the subgraph. Note the much larger runtime needed for querying the graph, despite resulting in the same amount of triples when providing a large enough \emph{LIMIT}.
\begin{figure}[h!]
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.65\linewidth]{img/no_limit_legend.pdf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.6\linewidth]{img/no_limit.pdf}
\caption{Execution times per RDFType of the RiverBodyOfWater dataset (containing 49915 triples)}
\label{fig:exec_times_no_limit}
\end{figure}

\subsection{Correctness}
\label{section:correctness}
\subsubsection{ShEx Generation}
Kristina Magnussen's avatar
Kristina Magnussen committed
We thought \emph{Shexer}, which was already mentioned in Section~\ref{section:related_work}, was a good fit for cross validating our \emph{ShEx}-generation. However, due to our limited knowledge of operating this tool, we did not manage to generate proper constraints for our RiverBodyOfWater-dataset. Our attempt at using this tool is shown in Figure~\ref{code:shexer}, which generated the trivial, non-restrictive constraints shown in Figure~\ref{fig:shexer_output}.
Kristina Magnussen's avatar
Kristina Magnussen committed
Therefore, we checked the generated constraints manually for small subgraphs (see Figures~\ref{fig:shex_canal_50}, \ref{fig:shex_rbow_50} and \ref{fig:shex_service_50}) and identified two issues with our tool.
Kristina Magnussen's avatar
Kristina Magnussen committed
Firstly, if the dataset consists of only stand-alone blank nodes, as seen in Figure~\ref{code:blank_nodes}, then \emph{Rdf2Graph} does not infer any \emph{ShEx}-constraints. This was the case for the generated subgraph using RiverBodyOfWater with a \emph{LIMIT} of 50, and the resulting \emph{ShEx} can be seen in Figure~\ref{fig:shex_rbow_50}.
Valerian Wintner's avatar
Valerian Wintner committed
Secondly, optional properties are not always inferred and therefore missing from the generated \emph{ShEx}-constraints. This also happens for unlimited subgraphs (see Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit}), with the exception of the RiverBodyOfWater-RDFtype, where it looks like the constraints are complete. However, due to the size of the graph manually checking for correctness is infeasible. We did not see a correlation between missing constraint-properties and the shape of the graph.


\subsubsection{ShEx Validation}
\todo{No subsection for just 2 sentences}

The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape}\cite{RDFShape}. The validation result was the same as in our tool.
Valerian Wintner's avatar
Valerian Wintner committed
\lstinputlisting{code_snippets/blank_nodes.ttl}
\caption{Blank Nodes in Turtle File}
\label{code:blank_nodes}
\section{Results} \label{section:results}
User expired's avatar
User expired committed
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. User can also edit constraints.
\Danielle{maybe mention that while the validation of the subgraph works as expected, the constraints generated are prone to missing an optional url attribute and would benefit from more tests and tuning. }
\todo{describe results of benchmark tests here}

\section{Future work}
Valerian Wintner's avatar
Valerian Wintner committed
\label{section:future_work}
Our application currently only handles two different datasets. For future work, this could be expanded so that the framework could handle more and bigger datasets. Currently, the size of the datasets that can be handled is limited by the RAM on the virtual machine. One possible solution for this could be to only work on parts of the graph.
One problem we encountered when handling datasets from \emph{CommonCrawl} was the quality of these datasets. Many datasets include \emph{non-unicode} characters, which are replaced by Jena with \emph{unicode} characters. This takes a lot of computing time. In addition, many files include invalid \emph{RDF} syntax or are otherwise damaged. This means that in order to handle additional datasets, some way of processing these datasets would have to be implemented. Processing could include filtering for broken files and invalid syntax and fixing this before handling the dataset in the framework.
In addition, more possibilities for user interaction could be added. For instance, a feature could be added where a user can upload their own dataset and have it validated.
\section{Conclusion}
Valerian Wintner's avatar
Valerian Wintner committed
\label{section:conclusion}
User expired's avatar
User expired committed
\Jamie{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?) + add chellenges to respective section}
\Jamie{Did we achieve what we wanted to do? How well and reliably does the framework work?}
Kristina Magnussen's avatar
Kristina Magnussen committed
% ------------------------------------------------------------------------
% Bibliography
% ------------------------------------------------------------------------
\bibliography{./bibliography.bib}
Kristina Magnussen's avatar
Kristina Magnussen committed
\bibliographystyle{abbrv}

\appendix
\section{Contribution Statements}
Each member of the group contributed in an enthusiastic and equal manner, leveraging their individual skills to contribute to the parts of the project where they could make the most impact. Additionally, everyone was eager to learn new skills and teach others what they knew. We are all satisfied with how much everyone contributed and how well we worked together. The following presents a brief summary of each group member's contribution to the project:
\begin{itemize}
  \item Danielle's programming and organizational skills were a great asset to the team. She ensured that meetings had structure, with clear goals, responsibilities, and deadlines defined. She worked most the implementing the \emph{ShEx} validation and developing the backend app, leading several pair programming sessions with her team members. She also participated in the final presentation and parts of the report.
  \item Jamie contributed most with her teamwork skills and adaptability. Although she did not have as much experience programming in java and javascript as other team members, she readily made an effort to learn and thrived with the pair programming method we implemented, working largely on the webapp and parts of the rdf2 implementation. Additionally, she was heavily involved in the designing the presentations and reviewing the report.
  \item Kristina contributed most in the research and planning of the project. Her research skills were heavily utilized in the initial phase of the project, which greatly helped the others when it came to choosing libraries and overcoming difficulties in the implementation of technical problems. She worked most on programming parts of the project requiring knowledge of shex. Her extra research also proved useful in delivering thorough and well thought out presentations and drafting the report.
  \item Philipp's technical skills were highly useful in the programming part of the project. He advised the selection of the tech stack and led many pair programming sessions, readily sharing his technical knowledge with the other team members. This also came in handy when contributing the evaluation and 'related work' sections of the report.
  \item Valerian, similarly to Philipp, also had strong technical skills that he applied in various areas of the project. He worked on the sparql parts of the programming and on creating the frontend app, often leading a pair programming session. He also contributed to the evaluation and writing up the results of this.
\end{itemize}

Kristina Magnussen's avatar
Kristina Magnussen committed
\section{Appendix}
You may use appendices to include any auxiliary results you would like to share, however cannot insert in the main text due to the page limit. 
\subsection{Code listings and additional data}
\begin{figure}
    \centering
    \lstinputlisting[language=python]{code_snippets/shexer.py}
    \caption{Running shexer on the full graph}
    \label{code:shexer}
\end{figure}

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/shexer_out.shex}
    \caption{Shexer output}
    \label{fig:shexer_output}
\end{figure}
\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Canal_50.shex}
    \caption{Generated \emph{ShEx}-constraints of Canal with \emph{LIMIT} 50}
    \label{fig:shex_canal_50}
\end{figure}

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_50.shex}
    \caption{Generated \emph{ShEx}-constraints of RiverBodyOfWater with \emph{LIMIT} 50}
    \label{fig:shex_rbow_50}
\end{figure}

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Service_50.shex}
    \caption{Generated \emph{ShEx}-constraints of Service with \emph{LIMIT} 50}
    \label{fig:shex_service_50}
\end{figure}

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Canal_null.shex}
    \caption{Generated \emph{ShEx}-constraints of Canal without a \emph{LIMIT}}
    \label{fig:shex_canal_wo_limit}
\end{figure}

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_RiverBodyOfWater_null.shex}
    \caption{Generated \emph{ShEx}-constraints of RiverBodyOfWater without a \emph{LIMIT}}
    \label{fig:shex_rbow_wo_limit}
\end{figure}

\begin{figure}
    \centering
    \lstinputlisting{code_snippets/RiverBodyOfWater_Service_null.shex}
    \caption{Generated \emph{ShEx}-constraints of Service without a \emph{LIMIT}}
    \label{fig:shex_service_wo_limit}
\end{figure}

\begin{table}[!ht]
    \centering
    \begin{tabular}{|l|r|r|r|r|}
    \hline
        Rdftype & Triples & [$t_{graph}$] = ms & [$t_{shex}$] = ms & [$t_{validation}$] = ms \\ \hline
        Canal & 16961 & 360000 & 737 & 45 \\ \hline
        GeoCoordinates & 204 & 468000 & 585 & 4 \\ \hline
        RiverBodyOfWater & 1769 & 468000 & 613 & 15 \\ \hline
        Service & 7334 & 462000 & 618 & 19 \\ \hline
    \end{tabular}
    \caption{Execution times per RDF-Type, queried on full graph of the RiverBodyOfWater dataset (containing 49915 triples)}
    \label{table:runtimes_wo_limit}
\end{table}

\begin{table}
    \centering
    \begin{tabular}{|l|r|r|r|r|r|}
    \hline
        Rdftype & Limit & Triples & [$t_{graph}$] = ms & [$t_{shex}$] = ms & [$t_{validation}$] = ms \\ \hline
        Canal & 50 & 226 & 2420 & 923 & 44 \\ \hline
        Canal & 100 & 328 & 2260 & 709 & 13 \\ \hline
        Canal & 200 & 765 & 1740 & 637 & 8 \\ \hline
        Canal & 400 & 1588 & 2020 & 559 & 9 \\ \hline
        Canal & 800 & 3176 & 3270 & 661 & 8 \\ \hline
        Canal & 1600 & 6817 & 5030 & 654 & 17 \\ \hline
        Canal & 3200 & 13504 & 9100 & 736 & 33 \\ \hline
        Canal & 6400 & 16961 & 10790 & 665 & 25 \\ \hline
        RiverBodyOfWater & 50 & 192 & 2680 & 615 & 5 \\ \hline
        RiverBodyOfWater & 75 & 291 & 2490 & 586 & 4 \\ \hline
        RiverBodyOfWater & 100 & 1187 & 2700 & 624 & 7 \\ \hline
        RiverBodyOfWater & 200 & 1769 & 4890 & 643 & 17 \\ \hline
        RiverBodyOfWater & 400 & 1769 & 4840 & 641 & 8 \\ \hline
        RiverBodyOfWater & 800 & 1769 & 4980 & 619 & 18 \\ \hline
        Service & 50 & 615 & 1640 & 602 & 7 \\ \hline
        Service & 100 & 1022 & 1790 & 562 & 6 \\ \hline
        Service & 200 & 1852 & 2300 & 577 & 9 \\ \hline
        Service & 400 & 3041 & 2880 & 601 & 6 \\ \hline
        Service	& 800 & 5437 & 4500	& 674 & 30 \\ \hline
        Service	& 1600 & 7334 & 5350 & 639 & 26 \\ \hline
    \end{tabular}
    \caption{Execution times per RDF-Type, limited size of start-node subset (using the RiverBodyOfWater dataset)}
    \label{table:runtimes_w_limit}
Kristina Magnussen's avatar
Kristina Magnussen committed
\end{document}