Skip to content
Snippets Groups Projects
Commit 94366804 authored by Kristina Magnussen's avatar Kristina Magnussen
Browse files

fixed things annotated by Elwin

parent 1cdee8db
No related branches found
No related tags found
No related merge requests found
No preview for this file type
......@@ -123,7 +123,7 @@ This is where \emph{knowledge graphs} become important. \emph{Knowledge graphs}
Working with such datasets can be greatly facilitated by defining a consistent shape for the data, based on the type of entity it represents.
This shaping is done by inferring constraints over the data and validating all nodes in the graph based on these constraints. This can give important insight into the structure of the data. For instance, when all nodes of a type conform to the given constraints, it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
For this reason, we implemented a framework for shaping \emph{knowledge graphs}. This consisted of three major steps: fetching a subset dataset from a \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A future outlook is given in Section~\ref{section:future_work}. Finally, a conclusion of our work is provided in Section~\ref{section:conclusion}.
For this reason, we implemented a framework for shaping \emph{knowledge graphs}. This consisted of three major steps: fetching a subset dataset from a \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A conclusion of our work is provided in Section~\ref{section:conclusion}. Finally, future outlook is given in Section~\ref{section:future_work}.
\section{Related Work}
......@@ -132,11 +132,11 @@ For this reason, we implemented a framework for shaping \emph{knowledge graphs}.
The need for automatic tools that are able to infer meta information on the structure of \emph{knowledge graphs} has already been recognized by different researchers. This stems from the fact that manual constraint inference becomes infeasible for large datasets.
One tool which can be used to automatically infer constraints over a \emph{knowledge graph} is \emph{RDF2Graph}\cite{vanDam2015,original_rdf2graph_git}. Our framework makes use of an adapted version of this tool by Werkmeister \cite{werkmeister2018,werkmeister_rdf2graph_git}, which uses several \emph{SPARQL} queries to gather the structural information of each node in the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used the \emph{RDF2Graph} tool on the \emph{UniPort RDF} resource, Werkmeister made adaptions to also infer Wikidata constraints.
One tool which can be used to automatically infer constraints over a \emph{knowledge graph} is \emph{RDF2Graph} \cite{vanDam2015,original_rdf2graph_git}. Our framework makes use of an adapted version of this tool by Werkmeister \cite{werkmeister2018,werkmeister_rdf2graph_git}, which uses several \emph{SPARQL} queries to gather the structural information of each node in the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used the \emph{RDF2Graph} tool on the \emph{UniPort RDF} resource, Werkmeister made adaptions to also infer Wikidata constraints.
Fernandez-{Álvarez} et al. have taken a different approach with their tool \emph{Shexer}\cite{Shexer}. In contrast to the aforementioned tool, they avoid querying the whole underlying graph by using an iterative approach, determining whether or not the current iterated (sub-)set of triples is relevant for the constraint generation process. Given a target shape, the preselected triples are used to decorate each target instance with its constraints.
Fernandez-{Álvarez} et al. have taken a different approach with their tool \emph{Shexer} \cite{Shexer}. In contrast to the aforementioned tool, they avoid querying the whole underlying graph by using an iterative approach, determining whether or not the current iterated (sub-)set of triples is relevant for the constraint generation process. Given a target shape, the preselected triples are used to decorate each target instance with its constraints.
Another constraint generator has been introduced by Spahiu et al. with \emph{ABSTAT}\cite{Spahiu2016ABSTATOL}. This tool uses an approach similar to that of \emph{RDF2Graph} by collecting structural information using \emph{SPARQL} queries and summarizing those constraints afterwards.
Another constraint generator has been introduced by Spahiu et al. with \emph{ABSTAT} \cite{Spahiu2016ABSTATOL}. This tool uses an approach similar to that of \emph{RDF2Graph} by collecting structural information using \emph{SPARQL} queries and summarizing those constraints afterwards.
\section{Approach} \label{section:approach}
......@@ -162,7 +162,7 @@ The repository also includes a README file describing how to set-up and install
\subsection{Technology Stack}
In this section, we briefly enumerate the main technologies that we used in this project.
We used \emph{Maven} as a project management tool. The framework was implemented in \emph{Java}. Here, we also used the \emph{Java} framework \emph{Jena}\cite{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue3}\cite{Vue} as a front-end framework and \emph{PrimeVue} as a library for the different UI components. For the deployment of our application we used a single virtual machine. Access to the front-end is done via a single \emph{Apache} server. The front-end accesses the back-end via an internal \emph{REST-API}.
We used \emph{Maven} as a project management tool. The framework was implemented in \emph{Java}. Here, we also used the \emph{Java} framework \emph{Jena} \cite{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue3} \cite{Vue} as a front-end framework and \emph{PrimeVue} as a library for the different UI components. For the deployment of our application we used a single virtual machine. Access to the front-end is done via a single \emph{Apache} server. The front-end accesses the back-end via an internal \emph{REST-API}.
\FloatBarrier
......@@ -184,8 +184,8 @@ Figure~\ref{fig:query_construct_subgraph} shows the query we used to create a su
\subsection{Generating Constraints}
\label{generatingconstraints}
To shape a \emph{knowledge graph} we need to infer constraints on the previously fetched subgraph.
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister_rdf2graph_git} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a constructed subgraph as described in Section ~\ref{section:fetchingKG}.
To shape a \emph{knowledge graph}, we need to infer constraints on the previously fetched subgraph.
For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph} \cite{original_rdf2graph_git} by Werkmeister \cite{werkmeister_rdf2graph_git} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a constructed subgraph as described in Section ~\ref{section:fetchingKG}.
The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax.
%\subsubsection{Integrating RDF2Graph with our framework}
......@@ -342,7 +342,7 @@ Secondly, optional properties are not always inferred and therefore missing from
\subsubsection{ShEx Validation}
The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape}\cite{RDFShape}. The validation result was the same as in our tool.
The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape} \cite{RDFShape}. The validation result was the same as in our tool.
\begin{figure}
\centering
......@@ -355,14 +355,6 @@ The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMI
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. In addition, the user can also edit the constraints. Our evaluation in Section~\ref{section:evaluation} showed that the validation of a subgraph works as expected. However, the constraints generated are prone to missing an optional \emph{url} attribute and would benefit from more tests and tuning. In addition, we could also see that the runtime of the tool is rather slow. Possible improvements to our framework are discussed in more depth in Section~\ref{section:future_work}. Generally, our tool works as intended, even though there is still some room for improvement.
\section{Future work}
\label{section:future_work}
Our application currently only handles two different datasets. For future work, this could be expanded so that the framework could handle more and bigger datasets. Currently, the size of the datasets that can be handled is limited by the RAM on the virtual machine. One possible solution for this could be to only work on parts of the graph.
One problem we encountered when handling datasets from \emph{CommonCrawl} was the quality of these datasets. Many datasets include \emph{non-unicode} characters, which are replaced by Jena with \emph{unicode} characters. This takes a lot of computing time. In addition, many files include invalid \emph{RDF} syntax or are otherwise damaged. This means that in order to handle additional datasets, some way of processing these datasets would have to be implemented. Processing could include filtering for broken files and invalid syntax and fixing this before handling the dataset in the framework.
In addition, more possibilities for user interaction could be added. For instance, a feature could be added where a user can upload their own dataset and have it validated.
% Note: this was written late at night in about 10min, so please feel free to make any edits you want to to improve it :)
\section{Conclusion}
\label{section:conclusion}
%\Jamie{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?) + add chellenges to respective section}
......@@ -376,6 +368,12 @@ In addition, although the vast majority of the constraints that could be present
The validation of these constraints though, works as expected, and without performance issues.
In the end, we succeeded in developing a working prototype that could form the base of a more powerful, flexible tool for easily gaining insight into any \emph{knowledge graph}.
\section{Future work}
\label{section:future_work}
Our application currently only handles two different datasets. For future work, this could be expanded so that the framework could handle more and bigger datasets. Currently, the size of the datasets that can be handled is limited by the RAM on the virtual machine. One possible solution for this could be to only work on parts of the graph.
One problem we encountered when handling datasets from \emph{CommonCrawl} was the quality of these datasets. Many datasets include \emph{non-unicode} characters, which are replaced by Jena with \emph{unicode} characters. This takes a lot of computing time. In addition, many files include invalid \emph{RDF} syntax or are otherwise damaged. This means that in order to handle additional datasets, some way of processing these datasets would have to be implemented. Processing could include filtering for broken files and invalid syntax and fixing this before handling the dataset in the framework.
In addition, more possibilities for user interaction could be added. For instance, a feature could be added where a user can upload their own dataset and have it validated.
% ------------------------------------------------------------------------
% Bibliography
......
......@@ -12,14 +12,21 @@
series = {Synthesis Lectures on the Semantic Web: Theory and Technology}
}
@misc{Vue,
title = {{Vue.js}, Documentation},
howpublished = {\url{https://v3.vuejs.org/}}
year = 2021,
howpublished = {\url{https://v3.vuejs.org/}},
note = {Accessed: 2022-02-01}
}
@misc{Primeue,
title = {{PrimeVue}, Documentation},
howpublished = {\url{https://www.primefaces.org/primevue/}}
year = 2021,
howpublished = {\url{https://www.primefaces.org/primevue/}},
note = {Accessed: 2022-02-01}
}
@book{validatingKG,
......@@ -36,15 +43,14 @@
series = {Synthesis Lectures on the Semantic Web: Theory and Technology}
}
@misc{Jena,
title = {{Apache Jena}, Documentation},
howpublished = {\url{https://jena.apache.org/index.html}}
year = 2021,
howpublished = {\url{https://jena.apache.org/index.html}},
note = {Accessed: 2022-02-01}
}
@misc{Primeue,
title = {{PrimeVue}, Documentation},
howpublished = {\url{https://www.primefaces.org/primevue/}}
}
@misc{werkmeister_rdf2graph_git,
author = {WerkMeister, Lucas},
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment