Skip to content
Snippets Groups Projects
Commit cbc9233f authored by User expired's avatar User expired
Browse files

restructured report, introductory sentences

parent 658d18a7
No related branches found
No related tags found
1 merge request!16Kristina report
......@@ -114,13 +114,13 @@ I added a comment colour for everyone.
\Jamie{We need an abstract!}
\section{Introduction}
\label{introduction}
Nowadays, more and more devices collect data very quickly. Since such a huge amount of data will get confusing, we need some way of representing this data in a useful way. This is where \emph{knowledge graphs} become important. There exist various definitions of \emph{knowledge graphs} but as the name indicates, they are basically a knowledge model that is structured as a graph. That knowledge model contains types, entities, literals as well as relationships. A \emph{knowledge graph} can make it easier to find and process facts in which one might be interested.
However, a problem that occurs when working with large datasets is that they can be inconsistent and might contain errors. In order to work with this data properly, it is necessary to shape the \emph{knowledge graph} in which this data is contained. This shaping is done by inferring constraints over the data and validating it based on these constraints. Validating a graph against constraints gives important insight into the structure of the data. For instance, when all nodes of a type conform to constraints, then it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps, namely fetching \emph{knowledge graphs}, inferring constraints and verification of \emph{knowledge graphs}, for which we provide a user interface. These steps are described in Section~\ref{section:approach}.
The results of this approach are shown in Section~\ref{section:results}. Moreover, our evaluations are outlined in Section~\ref{section:evaluation}. A conclusion of our work is provided in Section~\ref{section:conclusion}.
Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps, namely fetching \emph{knowledge graphs}, inferring constraints and verification of \emph{knowledge graphs}, for which we provide a user interface. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A conclusion of our work is provided in Section~\ref{section:conclusion}.
\section{Related Work}
......@@ -129,7 +129,15 @@ The results of this approach are shown in Section~\ref{section:results}. Moreove
\section{Approach} \label{section:approach}
%You may add any subsections you deem appropriate for your specific project. Some examples for your reference: Technology stack, Training strategy, Data, Experiments, etc.
Our framework offers a way to evaluate a \emph{knowledge graph} in an automated way. For this, we used \emph{knowledge graphs} from the \emph{CommonCrawl} datasets as a basis. The \emph{knowledge graphs} are imported as a static file. After this, our framework infers constraints over this data set (see Section~\ref{generatingconstraints}). These are validated automatically in the last step, see Section~\ref{validatingconstraints}. The user can interact with this framework over the front-end, see Section~\ref{frontend}. These different steps were implemented and tested separately. Once this was done, we consolidated them. The structure of our project can be seen in Fig.~\ref{fig:uml}. \todo{update figure}
To approach a framework that offers a way to evaluate a \emph{knowledge graph} in an automated way, we divided our project into subtasks.
At first, we fetch a subgraph of a \emph{knowledge graph} from the \emph{CommonCrawl} datasets which is explained in Section~\ref{fetchingKG}.
After this, our framework infers constraints over this data set (see Section~\ref{generatingconstraints}).
These are validated automatically in the last step, see Section~\ref{validatingconstraints}.
The structure of the framework can be seen in Fig.~\ref{fig:uml}. \todo{update figure}
The user can interact with this framework over the front-end, see Section~\ref{frontend}.
%Our framework offers a way to evaluate a \emph{knowledge graph} in an automated way. For this, we used \emph{knowledge graphs} from the \emph{CommonCrawl} datasets as a basis. The \emph{knowledge graphs} are imported as a static file. After this, our framework infers constraints over this data set (see Section~\ref{generatingconstraints}). These are validated automatically in the last step, see Section~\ref{validatingconstraints}. The user can interact with this framework over the front-end, see Section~\ref{frontend}. These different steps were implemented and tested separately. Once this was done, we consolidated them. The structure of our project can be seen in Fig.~\ref{fig:uml}. \todo{update figure}
\begin{figure}[ht]
\centering
......@@ -144,18 +152,24 @@ Our framework offers a way to evaluate a \emph{knowledge graph} in an automated
\todo{add reference to our github repo!}
\subsection{Technology Stack}
\Jamie{In general, it would be nice to have an introductory sentence at the beginning of each section}
In this section, we summarise the main technologies that we used to realise this project.
The framework was implemented in \emph{Java}. We used \emph{Maven} as a project management tool. We also used \emph{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue3}\cite{Vue} as a front-end framework and \emph{PrimeVue} as a library for the different UI components. For the deployment of our application we used a single virtual machine. Access to the front-end is done via a single \emph{Apache} server. The front-end accesses the back-end via an internal \emph{REST-API}.
\subsection{Fetching Knowledge Graphs}
Since \emph{knowledge graphs} can be very large and contain many nodes we want to concentrate only on a subgraph of it. That is what fetching a \emph{knowledge graphs} does. Fetching means that we extract the relevant subgraph from a \emph{knowledge graph}.
We take our initial \emph{knowledge graphs} from the \emph{CommonCrawl} datasets and import them as a static file.
\todo{Explain query magic that fetches the graph here}\\
\Valerian{Missing: Subsection about generating subgraph (with limit), starting from a certain type of node.} \Kristina{Wouldn't this be part of Generating constraints? I feel like that doesn't really fit into technology stack}
\subsection{Generating Constraints}
\label{generatingconstraints}
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister2018}\cite{werkmeister_rdf2graph_git}. \Valerian{(this is the old version, we use the fork)} \Kristina{I adapted the sentence, do you think this works now, Valerian?} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a \emph{knowledge graph} from \emph{CommonCrawl}. The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax. \Valerian{RDF2Graph offers a tool to export the constraints to ShEx syntax.} \Kristina{I adapted the sentence, do you think this is okay now, Valerian?}
\missingfigure{ add query to graph (chosen by Philipp), e.g. multiplicy of argument etc.}
To shape a \emph{knowledge graphs} we need to infer constaints on the graph that we have fetched in the section before.
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister2018}\cite{werkmeister_rdf2graph_git}. \Valerian{(this is the old version, we use the fork)} \Kristina{I adapted the sentence, do you think this works now, Valerian?} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a \emph{knowledge graph} from \emph{CommonCrawl} \Jamie{Why don't we have the fetched KG as input?}. The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax. \Valerian{RDF2Graph offers a tool to export the constraints to ShEx syntax.} \Kristina{I adapted the sentence, do you think this is okay now, Valerian?}
\missingfigure{ add query to graph (chosen by Philipp), e.g. multiplicy of argument etc.}\\
\subsubsection{Integrating RDF2Graph with our framework}
We implemented the following steps in order to integrate \emph{RDF2Graph} into our project. We added \emph{RDF2graph} to our framework so that they could be compiled together\Valerian{, and in the process minimally updated it to be compatible with our version of Java and Jena}. \Kristina{I would leave out the minimally, no need to downplay the work you did on this here, I think. Apart from that, I like it, feel free to put it in} In addition, we changed some of the initial parameters of the \emph{RDF2Graph}, since it originally was intended as a stand-alone application. As we are handling \emph{Models} \todo{Add explanation for Model? Maybe in glossary?} in our software, we changed the input to \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a Model needed to be created by \emph{RDF2Graph}, now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work \todo{should we explain this in more detail?}. As we only work on one Model at a time, we only use one output graph.
%\subsubsection{Integrating RDF2Graph with our framework}
We implemented the following steps in order to integrate \emph{RDF2Graph} into our project. We added \emph{RDF2Graph} to our framework so that they could be compiled together\Valerian{, and in the process minimally updated it to be compatible with our version of Java and Jena}. \Kristina{I would leave out the minimally, no need to downplay the work you did on this here, I think. Apart from that, I like it, feel free to put it in} In addition, we changed some of the initial parameters of the \emph{RDF2Graph}, since it originally was intended as a stand-alone application. As we are handling \emph{Models} \todo{Add explanation for Model? Maybe in glossary?} in our software, we changed the input to \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a Model needed to be created by \emph{RDF2Graph}, now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work \todo{should we explain this in more detail?}. As we only work on one Model at a time, we only use one output graph.
\todo{Add explanation of limit to this section?}
......@@ -187,36 +201,17 @@ If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min
\label{fig:sequence}
\end{figure}
\section{Results} \label{section:results}
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit \todo{explain this limit in more depth, maybe in front-end?} using the front-end. User can also edit constraints.
\missingfigure{Maybe add small figure that shows workflow of project here? Something similar like we did in presentation but more professional?}
\todo{describe results of benchmark tests here}
\section{Evaluation} \label{section:evaluation}
\todo{add benchmarks here}
\todo{ensure that images are well-placed}
\todo{check what Elwin said concerning Evaluation on meeting 20.01.2022}
\subsection{Future work}
\todo{Possible future work could be: more data sets, more possibilities for user inputs}
Our application currently only handles two different datasets. For future work, this could be expanded so that the framework could handle more and bigger datasets. Currently, the size of the datasets that can be handled is limited by the RAM on the virtual machine. One possible solution for this could be to only work on parts of the graph.
One problem we encountered when handling datasets from \emph{CommonCrawl} was the quality of these datasets. Many datasets include \emph{non-unicode} characters, which are replaced by Jena with \emph{unicode} characters. This takes a lot of computing time. In addition, many files include invalid \emph{RDF} syntax or are otherwise damaged. This means that in order to handle additional datasets, some way of processing these datasets would have to be implemented. Processing could include filtering for broken files and invalid syntax and fixing this before handling the dataset in the framework.
\todo{Should we add proper SPARQL endpoints here? Might not be possible?}
In addition, more possibilities for user interaction could be added. For instance, a feature could be added where a user can upload their own dataset and have it validated.
\subsection{Methodology}
\Philipp{i labeled all the included graphics with h!, when we have finished the report we might want to make it so, that one image is on the top and one on the bottom, if 2 pages are on the same page, for example}
For taking the measurements, the application was started locally on our hardware,\Kristina{I would put a full stop here and maybe start the next sentence like "This was done to minimise..} to minimize side-effects of other applications running on the virtual machine where the live-instance is deployed. The JVM was additionally setup \Kristina{Additionally, the JVM was set up..?} to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. \Kristina{This sentence is very long, maybe we can split it somehow?}
For taking measurements, the application was started locally on our hardware.
This was done to minimise side-effects of other applications running on the virtual machine where the live-instance is deployed. Additionally, the JVM was set up to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. \Kristina{This sentence is very long, maybe we can split it somehow?}
\subsection{Runtime}
Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show our \Kristina{the?} measurements we obtained by changing the \emph{LIMIT} input parameter. This parameter limits the size of the start-node subset, from which connected nodes are queried. All the measurements are shown in Tables~\ref{table:runtimes_wo_limit} and \ref{table:runtimes_w_limit}.
Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show the measurements we obtained by changing the \emph{LIMIT} input parameter. This parameter limits the size of the start-node subset, from which connected nodes are queried. All the measurements are shown in Tables~\ref{table:runtimes_wo_limit} and \ref{table:runtimes_w_limit}.
\begin{figure}[h!]
\begin{subfigure}{\textwidth}
......@@ -301,6 +296,8 @@ Secondly, optional properties are not always inferred and therefore missing from
\subsubsection{ShEx Validation}
\todo{No subsection for just 2 sentences}
The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape}\cite{RDFShape}. The validation result was the same as in our tool.
\begin{figure}
......@@ -310,6 +307,20 @@ The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMI
\label{code:blank_nodes}
\end{figure}
\section{Results} \label{section:results}
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit \todo{explain this limit in more depth, maybe in front-end?} using the front-end. User can also edit constraints.
\missingfigure{Maybe add small figure that shows workflow of project here? Something similar like we did in presentation but more professional?}
\todo{describe results of benchmark tests here}
\section{Future work}
\todo{Possible future work could be: more data sets, more possibilities for user inputs}
Our application currently only handles two different datasets. For future work, this could be expanded so that the framework could handle more and bigger datasets. Currently, the size of the datasets that can be handled is limited by the RAM on the virtual machine. One possible solution for this could be to only work on parts of the graph.
One problem we encountered when handling datasets from \emph{CommonCrawl} was the quality of these datasets. Many datasets include \emph{non-unicode} characters, which are replaced by Jena with \emph{unicode} characters. This takes a lot of computing time. In addition, many files include invalid \emph{RDF} syntax or are otherwise damaged. This means that in order to handle additional datasets, some way of processing these datasets would have to be implemented. Processing could include filtering for broken files and invalid syntax and fixing this before handling the dataset in the framework.
\todo{Should we add proper SPARQL endpoints here? Might not be possible?}
In addition, more possibilities for user interaction could be added. For instance, a feature could be added where a user can upload their own dataset and have it validated.
\section{Conclusion} \label{section:conclusion}
\todo{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?)}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment