Skip to content
Snippets Groups Projects
CTCS_template.tex 9.29 KiB
Newer Older
Kristina Magnussen's avatar
Kristina Magnussen committed
% !TEX encoding = UTF-8 Unicode
\documentclass[10pt,parskip=half]{scrartcl}

\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{amssymb,amsmath}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage{epstopdf}
\usepackage{xspace}
\usepackage{lmodern}
\usepackage{textcomp}
\usepackage{bm}
\usepackage{layouts}
Kristina Magnussen's avatar
Kristina Magnussen committed
\usepackage{enumitem}
\usepackage{multirow}
\usepackage[table,xcdraw]{xcolor}
\usepackage[percent]{overpic}
\usepackage{tikz}
\usepgflibrary{arrows}% for more options on arrows
\usepackage[colorinlistoftodos]{todonotes}
\usepackage{color,soul}
\usepackage{subfig}
\usepackage{booktabs}
\usepackage{todonotes}
Kristina Magnussen's avatar
Kristina Magnussen committed
\usepackage[nottoc]{tocbibind} %to make the references appear in the ToC

\usepackage{lipsum}

\def\bf{\bfseries}

\begin{document}

%----------------------------------------------------------------------------------------
% Title page
%----------------------------------------------------------------------------------------
	
	\begin{center}\Large
		\includegraphics[width=5cm]{uibk_logo_4c_cmyk.pdf}\\[5mm]
		
		\begin{large}\bf
			PS 703301 --  WS 2021/22\\
			Current Topics in Computer Science\\[5mm]
		\end{large}
		Final Report\\[20mm]
		{\titlefont \huge Shapring Knowledge Graphs}\\[20mm]
Kristina Magnussen's avatar
Kristina Magnussen committed
		 
		Philipp Gritsch \\
		Jamie Hochrainer \\
		Kristina Magnussen \\
		Danielle McKenney\\
		Valerian Wintner\\[35mm]
		
		supervised by\\
		M.Sc. Elwin	Huaman\\
		\vfill
	\end{center}
\thispagestyle{empty}	
\pagebreak
% ------------------------------------------------------------------------
\tableofcontents
\pagebreak
% ------------------------------------------------------------------------
\section{Introduction}
\label{introduction}
\todo{add introduction to KGs}
%Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps. First of all, we had to fetch a subset of data using \emph{SPARQL} queries, see Section~\ref{fetchingdata}. After this, we had to infer constraints over this data set (see Section~\ref{generatingconstraints}). These were validated automatically in the last step, see Section~\ref{validatingconstraints}. In addition, we also implemented a front-end so that a user could interact with the given framework. \\


We used \emph{CommonCrawl} datasets as the base for the \emph{knowledge graph} which we wanted to assess. The data contained in those datasets is often inconsistent and might contain errors. In order to work with this data properly, it is necessary to shape the \emph{knowledge graph} in which this data is contained. This shaping is done by inferring constraints over the data and validating it based on this constraints. Validating a graph against constraints gives important insight into the structure of the data. For instance, when all nodes of a type conform to constraints, then it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or otherwise corrupt. \\

Kristina Magnussen's avatar
Kristina Magnussen committed

\section{Related Work}
\todo{Add thesis Werkmeister + RDF2Graph, also add another work, maybe from sources in thesis}
Kristina Magnussen's avatar
Kristina Magnussen committed

\section{Approach}
%You may add any subsections you deem appropriate for your specific project. Some examples for your reference: Technology stack, Training strategy, Data, Experiments, etc.
Our framework offers a way to evaluate a \emph{knowledge graph} in an automated way. It fetches a subset of data using \emph{SPARQL} queries, see Section~\ref{fetchingdata}. After this, it infers constraints over this data set (see Section~\ref{generatingconstraints}). These are validated automatically in the last step, see Section~\ref{validatingconstraints}. The user can interact with this framework over the front-end, see Section~\ref{frontend}. These different steps were implemented and tested separately. Once this was done, we consolidated them. The structure of our project can be seen in \todo{add reference to UML diagram}

\missingfigure{add UML diagram of structure here????}

Kristina Magnussen's avatar
Kristina Magnussen committed
\subsection{Technology Stack}
The framework was implemented in \emph{Java}. We used \emph{Maven} as a project management tool. We also used \emph{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue-JS}\cite{Vue} and \emph{JavasSript}. 



Kristina Magnussen's avatar
Kristina Magnussen committed

\subsection{Fetching a Data Set}
\label{fetchingdata}
We used a dataset from the \emph{CommonCrawl} datasets. In order to fetch a data set, we used a \emph{SPARQL} query, which returned a subgraph of the initial \emph{RDF} graph. \todo{query depth fixed?}

\missingfigure{add query as code snippet} 

Kristina Magnussen's avatar
Kristina Magnussen committed
\subsection{Generating Constraints}
\label{generatingconstraints}

Kristina Magnussen's avatar
Kristina Magnussen committed
\subsection{Validating Constraints}
\label{validatingconstraints}
Given a \emph{RDF} graph and a set of constraints, the validation  consists of verifying that every node in the graph fulfils the requirements given in the constraints. A graph consists of several different types. Each of those types must conform to its definition outlined in the constraints. The results of the validation is be a boolean flag for every single node in the graph, indicating whether or not it conforms to its type's constraints. In case of nonconformity, a reason will be given. \\

In our code, this is implemented in the following way. As input, we receive a \emph{RDF} subgraph as well as a set of constraints. We use this to generate a \emph{shape map}, which contains all of the types which need to be validated. For the actual validation, the \emph{ShExValidator} provided by the \emph{Jena} library was used. \todo{add reference to Jena library here?} The validator requires a set of constraints defined in valid \emph{ShEx} syntax and a \emph{shape map}. The \emph{shape map} describes which types of nodes need to be validated against which \emph{ShEx} constraint definitions.

\missingfigure{add picture/code snipped of shapeMap} 

The class \emph{ShexValidationRecord} stores the result of the validation for every single node of the graph. Not only is the individual result of every node checked against its relevant constraints, but we also calculate the percentage of nodes that conform to their constraints. 

\subsection{Front-end}
\label{frontend}
\todo{explain how different limit influences data output}
\missingfigure{add screenshot of the frontend} 
\missingfigure{add sequence diagram here} 


%\subsection{Citation}
%
%As a computer science student, you should read \cite{turing1950}. A useful \LaTeX\ companion is \cite{mittelbach2004}.
%
%\subsection{Table}
%\begin{table}[ht]
%	\centering
%	\begin{tabular}{lll}
%		\toprule
%		%Row1
%		X & Y1 & Y2\\
%		\midrule
%		%Row2
%		X1 & 1 &2\\
%		%Row3
%		X2 & 3 & 4\\
%		%Row4
%		X3 &5& 6\\
%		\bottomrule
%	\end{tabular}
%	\caption{This is a very simple table.} \label{tab:example1}
%\end{table}
%
%\begin{table}[ht]
%\centering
%	\begin{tabular}{lcc}
%		\toprule
%		%Row1
%		X & Y1 & Y2\\
%		\midrule
%		%Row2
%		X1 & \multicolumn{2}{c}{1}\\
%		%Row3
%		X2 & \multirow{2}{*}{2} & 3\\
%		%Row4
%		X3 && 4\\
%		\bottomrule
%	\end{tabular}
%\caption{This is another table.}
%\label{tab:example2}
%\end{table}
%
%You can refer to Table \ref{tab:example2} or Table \ref{tab:example1}.
%
%\subsection{Figure}
%
%\begin{figure}[ht]
%	\centering
%	\includegraphics[width=100pt]{mannequin.jpg}
%	\caption{This is a figure.}
%	\label{fig:example}
%\end{figure}
%
%There is a beautiful figure (Fig.~\ref{fig:example}).
Kristina Magnussen's avatar
Kristina Magnussen committed

\section{Results}
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit \todo{explain this limit in more depth, maybe in front-end?} using the front-end. 

\missingfigure{Maybe add small figure that shows workflow of project here? Something similar like we did in presentation but more professional?}

\todo{describe results of benchmark tests here} 

\subsection{Future work}
\todo{Possible future work could be: more data sets, more possibilities for user inputs}
Kristina Magnussen's avatar
Kristina Magnussen committed


\section{Conclusion}
\todo{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?)}
\todo{Did we achieve what we wanted to do? How well and reliably does the framework work?}
Kristina Magnussen's avatar
Kristina Magnussen committed


% ------------------------------------------------------------------------
% Bibliography
% ------------------------------------------------------------------------
\bibliography{bibliography}
\bibliographystyle{abbrv}

\appendix
\section{Contribution Statements}
Please write down a short contribution statement for each member of your group. You may evaluate the contribution along the three common categories:
i) conception (i.\,e., problem framing, ideation, validation, and method selection), ii) operational work (e.\,g., setting up your tech stack, algorithm implementation, data analysis, and interpretation), and iii) writing \& reporting (i.\,e., report drafting, literature review, revision of comments, presentation preparations, etc.).
\section{Appendix}
You may use appendices to include any auxiliary results you would like to share, however cannot insert in the main text due to the page limit. 
\end{document}