%With rising interest in Knowledge Graphs and their use, the necessity of maintaining the structured information emerges. \Kristina{maybe "necessity of obtaining the structure of the information stored" instead?} This can be achieved by ... Our tool aims to integrate all the essential processes into one framework.
%With rising interest in Knowledge Graphs and their use, the necessity of maintaining the structured information emerges. \Kristina{maybe "necessity of obtaining the structure of the information stored" instead?} This can be achieved by ... Our tool aims to integrate all the essential processes into one framework.
...
@@ -119,7 +119,7 @@ I added a comment colour for everyone.
...
@@ -119,7 +119,7 @@ I added a comment colour for everyone.
\section{Introduction}
\section{Introduction}
\label{introduction}
\label{introduction}
With the massive amount of data available on the internet, which is growing every day, a convenient, flexible, and efficient way of storing data becomes more and more important. In addition, concrete objects, abstract ideas as well as connections and relationships between entities have to be represented.
With the massive amount of data available on the internet, which is growing every day, a convenient, flexible, and efficient way of storing data becomes more and more important. In addition, concrete objects, abstract ideas as well as connections and relationships between entities have to be represented.
This is where \emph{knowledge graphs} become important. Knowledge graph structure data in the form of a graph. This graph can contain types, entities, literals, and relationships. A knowledge graph allows for flexible data structures and can make it easier to find and process relevant data. However, the datasets stored in such a graph are often inconsistent and prone to containing errors.
This is where \emph{knowledge graphs} become important. \emph{Knowledge graphs} structure data in the form of a graph. This graph can contain types, entities, literals, and relationships. A \emph{knowledge graph} allows for flexible data structures and can make it easier to find and process relevant data. However, the datasets stored in such a graph are often inconsistent and prone to containing errors.
Working with such datasets can be greatly facilitated by defining a consistent shape for the data, based on the type of entity it represents.
Working with such datasets can be greatly facilitated by defining a consistent shape for the data, based on the type of entity it represents.
This shaping is done by inferring constraints over the data and validating all nodes in the graph based on these constraints. This can give important insight into the structure of the data. For instance, when all nodes of a type conform to the given constraints, it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
This shaping is done by inferring constraints over the data and validating all nodes in the graph based on these constraints. This can give important insight into the structure of the data. For instance, when all nodes of a type conform to the given constraints, it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
...
@@ -130,7 +130,7 @@ For this reason, we implemented a framework for shaping \emph{knowledge graphs}.
...
@@ -130,7 +130,7 @@ For this reason, we implemented a framework for shaping \emph{knowledge graphs}.
\label{section:related_work}
\label{section:related_work}
The need for automatic tools that are able to infer meta information on the structure of knowledge graphs has already been recognized by different researchers. This stems from the fact that manual constraint inference becomes infeasible for large datasets.
The need for automatic tools that are able to infer meta information on the structure of \emph{knowledge graphs} has already been recognized by different researchers. This stems from the fact that manual constraint inference becomes infeasible for large datasets.
One tool which can be used to automatically infer constraints over a \emph{knowledge graph} is \emph{RDF2Graph}\cite{vanDam2015,original_rdf2graph_git}. Our framework makes use of an adapted version of this tool by Werkmeister \cite{werkmeister2018,werkmeister_rdf2graph_git}, which uses several \emph{SPARQL} queries to gather the structural information of each node in the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used the \emph{RDF2Graph} tool on the \emph{UniPort RDF} resource, Werkmeister made adaptions to also infer Wikidata constraints.
One tool which can be used to automatically infer constraints over a \emph{knowledge graph} is \emph{RDF2Graph}\cite{vanDam2015,original_rdf2graph_git}. Our framework makes use of an adapted version of this tool by Werkmeister \cite{werkmeister2018,werkmeister_rdf2graph_git}, which uses several \emph{SPARQL} queries to gather the structural information of each node in the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used the \emph{RDF2Graph} tool on the \emph{UniPort RDF} resource, Werkmeister made adaptions to also infer Wikidata constraints.
...
@@ -226,7 +226,7 @@ If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min
...
@@ -226,7 +226,7 @@ If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min
\caption{The frontend, showing a selection of dataset, RDFtype, and \emph{LIMIT} of starting-nodes.}
\caption{The frontend, showing a selection of dataset, RDFType, and \emph{LIMIT} of starting-nodes.}
\label{fig:frontend}
\label{fig:frontend}
\end{figure}
\end{figure}
...
@@ -337,11 +337,10 @@ Therefore, we checked the generated constraints manually for small subgraphs (se
...
@@ -337,11 +337,10 @@ Therefore, we checked the generated constraints manually for small subgraphs (se
Firstly, if the dataset consists of only stand-alone blank nodes, as seen in Figure~\ref{code:blank_nodes}, then \emph{Rdf2Graph} does not infer any \emph{ShEx}-constraints. This was the case for the generated subgraph using RiverBodyOfWater with a \emph{LIMIT} of 50, and the resulting \emph{ShEx} can be seen in Figure~\ref{fig:shex_rbow_50}.
Firstly, if the dataset consists of only stand-alone blank nodes, as seen in Figure~\ref{code:blank_nodes}, then \emph{Rdf2Graph} does not infer any \emph{ShEx}-constraints. This was the case for the generated subgraph using RiverBodyOfWater with a \emph{LIMIT} of 50, and the resulting \emph{ShEx} can be seen in Figure~\ref{fig:shex_rbow_50}.
Secondly, optional properties are not always inferred and therefore missing from the generated \emph{ShEx}-constraints. This also happens for unlimited subgraphs (see Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit}), with the exception of the RiverBodyOfWater-RDFtype, where it looks like the constraints are complete. However, due to the size of the graph manually checking for correctness is infeasible. We did not see a correlation between missing constraint-properties and the shape of the graph.
Secondly, optional properties are not always inferred and therefore missing from the generated \emph{ShEx}-constraints. This also happens for unlimited subgraphs (see Figures~\ref{fig:shex_canal_wo_limit}, \ref{fig:shex_rbow_wo_limit} and \ref{fig:shex_service_wo_limit}), with the exception of the RiverBodyOfWater-RDFType, where it looks like the constraints are complete. However, due to the size of the graph manually checking for correctness is infeasible. We did not see a correlation between missing constraint-properties and the shape of the graph.
\subsubsection{ShEx Validation}
\subsubsection{ShEx Validation}
\todo{No subsection for just 2 sentences \Kristina{I think it would be okay to leave the section here, since this is a new topic}}
The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape}\cite{RDFShape}. The validation result was the same as in our tool.
The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape}\cite{RDFShape}. The validation result was the same as in our tool.
...
@@ -353,7 +352,7 @@ The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMI
...
@@ -353,7 +352,7 @@ The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMI
\end{figure}
\end{figure}
\section{Results}\label{section:results}
\section{Results}\label{section:results}
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. In addition, the user edit constraints. Our evaluation in Section~\ref{section:evaluation} showed that the validation of a subgraph works as expected. However, the constraints generated are prone to missing an optional \emph{url} attribute and would benefit from more tests and tuning. In addition, we could also see that the runtime of the tool is rather slow. Possible improvements to our framework are discussed in more depth in Section~\ref{section:future_work}.
Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. In addition, the user can also edit the constraints. Our evaluation in Section~\ref{section:evaluation} showed that the validation of a subgraph works as expected. However, the constraints generated are prone to missing an optional \emph{url} attribute and would benefit from more tests and tuning. In addition, we could also see that the runtime of the tool is rather slow. Possible improvements to our framework are discussed in more depth in Section~\ref{section:future_work}. Generally, our tool works as intended, even though there is still some room for improvement.
\section{Future work}
\section{Future work}
...
@@ -366,16 +365,16 @@ In addition, more possibilities for user interaction could be added. For instanc
...
@@ -366,16 +365,16 @@ In addition, more possibilities for user interaction could be added. For instanc
% Note: this was written late at night in about 10min, so please feel free to make any edits you want to to improve it :)
% Note: this was written late at night in about 10min, so please feel free to make any edits you want to to improve it :)
\section{Conclusion}
\section{Conclusion}
\label{section:conclusion}
\label{section:conclusion}
\Jamie{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?) + add chellenges to respective section}
%\Jamie{Which challenges did we face during the implementation? (Maybe depth of SPARQL query, outdated RDF2Graph?) + add chellenges to respective section}
\Jamie{Did we achieve what we wanted to do? How well and reliably does the framework work?}
%\Jamie{Did we achieve what we wanted to do? How well and reliably does the framework work?}
Overall, we achieved the creation of a functional interface that allows a user to view and edit automatically generated constraints for a given graph.
Overall, we achieved the creation of a functional interface that allows a user to view and edit automatically generated constraints for a given graph.
A highlight of our tool is certainly its ease of use, simply presenting a user with a dropdown list of rdf types that they can evaluate.
A highlight of our tool certainly is its ease of use, simply presenting a user with a drop-down list of \emph{RDF} types that they can evaluate.
Although the selection is minuscule at the moment, this could be trivially expanded to allow a greater selection of rdf types from the commoncrawl dataset.
Although the selection is minuscule at the moment, this could be trivially expanded to allow a greater selection of \emph{RDF} types from the \emph{CommonCrawl} dataset.
One of the persistent flaws of the tool we developed remains its poor performance with larger graphs however, to a certain extent, this is inevitable when working with such large amounts of data.
One of the persistent flaws of the tool we developed remains its poor performance with larger graphs, however, to a certain extent, this is inevitable when working with such large amounts of data.
Such a complex SPARQL query that fetches an 'infinite depth' of related objects is bound have a relatively slow runtime.
Such a complex \emph{SPARQL} query that fetches an 'infinite depth' of related objects is bound have a relatively slow runtime.
What's more, although the vast majority of the constraints that could be present on a graph do get generated, more and automated testing would be required to increase confidence in the completeness and correctness of said constraints.
In addition, although the vast majority of the constraints that could be present on a graph do get generated, additional, as well as automated testing would be required to increase confidence in the completeness and correctness of said constraints.
The validation of these constraints though, works as expected, and without performance issues.
The validation of these constraints though, works as expected, and without performance issues.
In the end, we succeeded in developing a working prototype that could form the base of a more powerful, flexible tool for easily gaining insight into any knowledge graph.
In the end, we succeeded in developing a working prototype that could form the base of a more powerful, flexible tool for easily gaining insight into any \emph{knowledge graph}.
@@ -389,10 +388,10 @@ In the end, we succeeded in developing a working prototype that could form the b
...
@@ -389,10 +388,10 @@ In the end, we succeeded in developing a working prototype that could form the b
Each member of the group contributed in an enthusiastic and equal manner, leveraging their individual skills to contribute to the parts of the project where they could make the most impact. Additionally, everyone was eager to learn new skills and teach others what they knew. We are all satisfied with how much everyone contributed and how well we worked together. The following presents a brief summary of each group member's contribution to the project:
Each member of the group contributed in an enthusiastic and equal manner, leveraging their individual skills to contribute to the parts of the project where they could make the most impact. Additionally, everyone was eager to learn new skills and teach others what they knew. We are all satisfied with how much everyone contributed and how well we worked together. The following presents a brief summary of each group member's contribution to the project:
\begin{itemize}
\begin{itemize}
\item Danielle's programming and organizational skills were a great asset to the team. She ensured that meetings had structure, with clear goals, responsibilities, and deadlines defined. She worked most the implementing the \emph{ShEx} validation and developing the backend app, leading several pair programming sessions with her team members. She also participated in the final presentation and parts of the report.
\item Danielle's programming and organizational skills were a great asset to the team. She ensured that meetings had structure, with clear goals, responsibilities, and deadlines defined. She worked most the implementing the \emph{ShEx} validation and developing the backend app, leading several pair programming sessions with her team members. She also participated in the final presentation and parts of the report.
\item Jamie contributed most with her teamwork skills and adaptability. Although she did not have as much experience programming in java and javascript as other team members, she readily made an effort to learn and thrived with the pair programming method we implemented, working largely on the webapp and parts of the rdf2 implementation. Additionally, she was heavily involved in the designing the presentations and reviewing the report.
\item Jamie contributed most with her teamwork skills and adaptability. Although she did not have as much experience programming in java and javascript as other team members, she readily made an effort to learn and thrived with the pair programming method we implemented, working largely on the webapp and parts of the \emph{RDF2Graph} implementation. Additionally, she was heavily involved in the designing the presentations and reviewing the report.
\item Kristina contributed most in the research and planning of the project. Her research skills were heavily utilized in the initial phase of the project, which greatly helped the others when it came to choosing libraries and overcoming difficulties in the implementation of technical problems. She worked most on programming parts of the project requiring knowledge of shex. Her extra research also proved useful in delivering thorough and well thought out presentations and drafting the report.
\item Kristina contributed most in the research and planning of the project. Her research skills were heavily utilized in the initial phase of the project, which greatly helped the others when it came to choosing libraries and overcoming difficulties in the implementation of technical problems. She worked most on programming parts of the project requiring knowledge of \emph{ShEx}. Her extra research also proved useful in delivering thorough and well thought out presentations and drafting the report.
\item Philipp's technical skills were highly useful in the programming part of the project. He advised the selection of the tech stack and led many pair programming sessions, readily sharing his technical knowledge with the other team members. This also came in handy when contributing the evaluation and 'related work' sections of the report.
\item Philipp's technical skills were highly useful in the programming part of the project. He advised the selection of the tech stack and led many pair programming sessions, readily sharing his technical knowledge with the other team members. This also came in handy when contributing the evaluation and 'related work' sections of the report.
\item Valerian, similarly to Philipp, also had strong technical skills that he applied in various areas of the project. He worked on the sparql parts of the programming and on creating the frontend app, often leading a pair programming session. He also contributed to the evaluation and writing up the results of this.
\item Valerian, similarly to Philipp, also had strong technical skills that he applied in various areas of the project. He worked on the \emph{SPARQL} parts of the programming and on creating the frontend app, often leading a pair programming session. He also contributed to the evaluation and writing up the results of this.