resolved most comments

bfc0cad2 · Kristina Magnussen · b47b78cc · bfc0cad2 · bfc0cad2
Commit bfc0cad2 authored 3 years ago by Kristina Magnussen
--- a/CTCS_report_template/CTCS_template.pdf
+++ b/CTCS_report_template/CTCS_template.pdf
--- a/CTCS_report_template/CTCS_template.tex
+++ b/CTCS_report_template/CTCS_template.tex
@@ -119,12 +119,11 @@ I added a comment colour for everyone.
 \section{Introduction}
 \label{introduction}
 With the massive amount of data available on the internet, which is growing every day, a convenient, flexible, and efficient way of storing data becomes more and more important. In addition, concrete objects, abstract ideas as well as connections and relationships between entities have to be represented.
-This is where \emph{knowledge graphs} become important. There exist \Kristina{are instead of exist?} various definitions of \emph{knowledge graphs} but as the name indicates, they are basically a knowledge model that is structured as a graph. \Kristina{I would maybe shorten this sentence to just "Knowledge graph structure data in the form of a graph.} That knowledge model \Kristina{This graph?}can contain types, entities, literals, and relationships. A \emph{knowledge graph} can make it easier to find and process facts in which one might be interested. \Kristina{I think I would write something like "A knowledge graph allows for flexible data structures and can make it easier to find and process relevant data.} However, \Kristina{the datasets stored in such a graph are often inconsistent and prone to containing errors.}large datasets are often inconsistent and are prone to containing errors.
+This is where \emph{knowledge graphs} become important. Knowledge graph structure data in the form of a graph. This graph can contain types, entities, literals, and relationships. A knowledge graph allows for flexible data structures and can make it easier to find and process relevant data. However, the datasets stored in such a graph are often inconsistent and prone to containing errors.
-Working with this information \Kristina{such datasets?} can be greatly facilitated by defining a consistent shape for the data, based on the type of entity or idea \Kristina{I think I would leave out "idea", since it's less technical} it represents.
+Working with such datasets can be greatly facilitated by defining a consistent shape for the data, based on the type of entity it represents.
-This shaping is done by inferring constraints over the data and validating it \Kristina{all nodes in the graph} based on these  constraints. Validating a graph against constraints \Kristina {This can give}gives important insight into the structure of the data. For instance, when all nodes of a type conform to \Kristina{the given}constraints, then it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
+This shaping is done by inferring constraints over the data and validating all nodes in the graph based on these  constraints. This can give important insight into the structure of the data. For instance, when all nodes of a type conform to the given constraints, it may be useful to define these as required attributes for all future nodes to ensure uniformity in the data. Non conforming nodes may also deliver important insight into where information is missing. For example, if 99\% of nodes of a given type conform to some constraints, it may be worthwhile to investigate the remaining 1\% to see if they are missing necessary information or are otherwise corrupt. \\
-\Kristina{The following paragraph feels a bit disconnected here. Maybe we could split introduction into "Introduction to Knowledge Graphs" and "Our framework" or something like this?}
+For this reason, we implemented a framework for shaping \emph{knowledge graphs}. This consisted of three major steps:  fetching a subset dataset from a \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A future outlook is given in Section~\ref{section:future_work}. Finally, a conclusion of our work is provided in Section~\ref{section:conclusion}.
-Our task was to implement a framework for shaping \emph{knowledge graphs}. This consisted of three major steps:  fetching a subset dataset from a \emph{knowledge graphs}, inferring constraints, and validating a \emph{knowledge graph} against these constraints. We also provide a user interface to this purpose. These steps are described in Section~\ref{section:approach}. After this was done, we evaluated our framework concerning runtime and correctness which is outlined in Section~\ref{section:evaluation}. Results of our project are shown in Section~\ref{section:results}. A future outlook is given in Section~\ref{section:future_work}. Finally, a conclusion of our work is provided in Section~\ref{section:conclusion}.
 \section{Related Work}
@@ -133,13 +132,11 @@ Our task was to implement a framework for shaping \emph{knowledge graphs}. This
 The need for automatic tools that are able to infer meta information on the structure of knowledge graphs has already been recognized by different researchers. This stems from the fact that manual constraint inference becomes infeasible for large datasets.
-\Danielle{the citing here looks strange...}
+One tool which can be used to automatically infer constraints over a \emph{knowledge graph} is \emph{RDF2Graph}\cite{vanDam2015,original_rdf2graph_git}. Our framework makes use of an adapted version of this tool by Werkmeister  \cite{werkmeister2018,werkmeister_rdf2graph_git}, which uses several \emph{SPARQL} queries to gather the structural information of each node in the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used the \emph{RDF2Graph} tool on the \emph{UniPort RDF} resource, Werkmeister made adaptions to also infer Wikidata constraints.
-Our framework makes use of an adapted version by Werkmeister  ~\cite{werkmeister2018,werkmeister_rdf2graph_git} of \emph{RDF2Graph}~\cite{vanDam2015,original_rdf2graph_git}, which uses different \emph{SPARQL} queries to gather the structural information of each node of the underlying graph in a first phase. Subsequently, the queried information is gathered and simplified. This is achieved by merging constraint information of classes belonging to the same type and predicates. While Van Dam et al. used \emph{RDF2Graph} tool on the UniPort RDF resource, Werkmeister made adaptions to also infer Wikidata constraints.
-Fernandez-{Álvarez} et al. have taken a different approach with their tool \emph{Shexer}~\cite{Shexer}. In contrast to the aforementioned tool, they avoid querying the whole underlying graph by using an iterative approach, determining whether or not the current iterated (sub-)set of triples is relevant for the constraint generation process. Given a target shape, the preselected triples are used to decorate each target instance with its constraints.
+Fernandez-{Álvarez} et al. have taken a different approach with their tool \emph{Shexer}\cite{Shexer}. In contrast to the aforementioned tool, they avoid querying the whole underlying graph by using an iterative approach, determining whether or not the current iterated (sub-)set of triples is relevant for the constraint generation process. Given a target shape, the preselected triples are used to decorate each target instance with its constraints.
-Another constraint generator has been introduced by Spahiu et al. with \emph{ABSTAT}~\cite{Spahiu2016ABSTATOL}. This \Kristina{tool?}uses an approach similar to that of \emph{RDF2Graph} by collecting structural information using \emph{SPARQL} queries and summarizing those constraints afterwards.
+Another constraint generator has been introduced by Spahiu et al. with \emph{ABSTAT}\cite{Spahiu2016ABSTATOL}. This tool uses an approach similar to that of \emph{RDF2Graph} by collecting structural information using \emph{SPARQL} queries and summarizing those constraints afterwards.
 \section{Approach} \label{section:approach}
@@ -164,12 +161,12 @@ The repository also includes a README file describing how to set-up and install
 \subsection{Technology Stack}
-In this section, we summarise the main technologies that we used in this project.
+In this section, we briefly enumerate the main technologies that we used in this project.
-The framework was implemented in \emph{Java}. We used \emph{Maven} as a project management tool. We also used java framework \emph{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue3}\cite{Vue} as a front-end framework and \emph{PrimeVue} as a library for the different UI components. For the deployment of our application we used a single virtual machine. Access to the front-end is done via a single \emph{Apache} server. The front-end accesses the back-end via an internal \emph{REST-API}.
+We used \emph{Maven} as a project management tool. The framework was implemented in \emph{Java}. Here, we also used the \emph{Java} framework \emph{Jena}\cite{Jena}, which offers an \emph{RDF} API as well as support for \emph{SPARQL} queries and the \emph{ShEx} language. The front-end was implemented using \emph{Vue3}\cite{Vue} as a front-end framework and \emph{PrimeVue} as a library for the different UI components. For the deployment of our application we used a single virtual machine. Access to the front-end is done via a single \emph{Apache} server. The front-end accesses the back-end via an internal \emph{REST-API}.
 \FloatBarrier
-\subsection{Constructing Subgraph}
+\subsection{Constructing a Subgraph}
 \label{section:fetchingKG}
 Because \emph{knowledge graphs} can be very large and contain many nodes, we concentrated on querying smaller subgraphs and only working on those. With this method, the relevant subgraph gets extracted from a \emph{knowledge graph} and can be worked upon in isolation.
 We take our initial \emph{knowledge graphs} from the \emph{CommonCrawl} datasets and import them as a static file.
@@ -188,13 +185,13 @@ Figure~\ref{fig:query_construct_subgraph} shows the query we used to create a su
 \subsection{Generating Constraints}
 \label{generatingconstraints}
 To shape a \emph{knowledge graph} we need to infer constraints on the previously fetched subgraph.
-For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister2018}\cite{werkmeister_rdf2graph_git} \Jamie{display references nicer} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a constructed subgraph as described in Section ~\ref{section:fetchingKG}. 
+For the generation of constraints, we used the adaption of the tool \emph{RDF2Graph}\cite{original_rdf2graph_git} by Werkmeister\cite{werkmeister_rdf2graph_git} and adapted it for our purposes. As input, \emph{RDF2Graph} takes a constructed subgraph as described in Section ~\ref{section:fetchingKG}. 
 The properties of the graph are read out with several \emph{SPARQL} queries. These properties are saved in a new \emph{RDF} graph. As output, we receive a graph containing constraints for the initial input data. We use a tool offered by \emph{RDF2Graph} to extract the constraints in \emph{ShEx} syntax.
 %\subsubsection{Integrating RDF2Graph with our framework}
 We implemented the following steps in order to integrate \emph{RDF2Graph} into our project. 
 We added \emph{RDF2Graph} to our framework so that they could be compiled together, and in the process updated it as much as was needed to be compatible with our version of Java and Jena. In addition, we changed some of the initial parameters of the \emph{RDF2Graph}, since it originally was intended as a stand-alone application. 
-As we are handling \emph{Models} in our software, we changed the input from a \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a \emph{Model} needed to be created by \emph{RDF2Graph}, now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is  provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work. As we only work on one Model at a time, we only use one output graph.
+As we are handling \emph{Models} in our software, we changed the input from a \emph{RDF2Graph} to a \emph{Model}. In our application, \emph{RDF2Graph} does not use any other storage apart from the \emph{Model} data structure. Previously, such a \emph{Model} needed to be created by \emph{RDF2Graph}; now it is provided by our framework. We did this so we could have full control over the files handled by \emph{RDF2Graph}. \emph{RDF2Graph} allows for multithreaded execution, which requires a thread pool. This thread pool was initially created by \emph{RDF2Graph}. In our framework, it is  provided by our application. In addition, resources which are used by \emph{RDF2Graph} had to be provided in a different way so that they are still available when running from a server environment. We also changed some of the queries. \emph{RDF2Graph} supports multiple output graphs, however, this did not work. As we only work on one Model at a time, we only use one output graph.
@@ -202,9 +199,7 @@ As we are handling \emph{Models} in our software, we changed the input from a \e
 \label{validatingconstraints}
 Given a \emph{RDF} graph and a set of constraints, the validation consists of verifying that every node in the graph fulfils the requirements given in the constraints. A graph may contain node with different types. Each of those types must conform to its corresponding definition outlined in the constraints. The result of the validation is a multidimensional list containing every node's id, a boolean flag, and an optional 'reason' entry. The boolean flag indicates whether or not the node conforms to its type's constraints. In case of nonconformity, a reason will be given.
-For the implementation of this process, an \emph{RDF} subgraph and \emph{ShEx} constraints are required as input. Then, we use this to generate a \emph{shape map}, which contains all of the types that need to be validated. For the actual validation, the \emph{ShExValidator} provided by the \emph{Jena} library was used \cite{Jena}. The validator requires a set of constraints defined in valid \emph{ShEx} syntax and a \emph{shape map}.
+For the implementation of this process, an \emph{RDF} subgraph and \emph{ShEx} constraints are required as input. Then, we use this to generate a \emph{shape map}, which contains all of the types that need to be validated. For the actual validation, the \emph{ShExValidator} provided by the \emph{Jena} library was used. The validator requires a set of constraints defined in valid \emph{ShEx} syntax and a \emph{shape map}. We query the subgraph for its types of nodes (see Figure~\ref{fig:shape_map_query}), and construct the \emph{shape map} from that. Figure~\ref{fig:shape_map} shows an example.
-\Danielle{the following is repetetive, should be removed.}
-The \emph{shape map} describes which types of nodes need to be validated against which \emph{ShEx} constraint definitions. \Valerian{We query the subgraph for its types of nodes (See Figure~\ref{fig:shape_map_query}), and construct the \emph{shape map} from that. Figure~\ref{fig:shape_map} shows an example.}
 \begin{figure}
    \centering
@@ -220,9 +215,8 @@ The \emph{shape map} describes which types of nodes need to be validated against
    \label{fig:shape_map}
 \end{figure}
-\Danielle{Reformulate the following paragraph as follows:
-The class \emph{ShexValidationRecord} stores the result of the validation for each node of the graph. Additionally, the percentage of nodes that conform to their constraints is calculated and stored.}
+The class \emph{ShexValidationRecord} stores the result of the validation for each node of the graph. Additionally, the percentage of nodes that conform to their constraints is calculated and stored.
-The class \emph{ShexValidationRecord} stores the result of the validation for every single node of the graph. Not only is the individual result of every node checked against its relevant constraints, but we also calculate the percentage of nodes that conform to their constraints.
 \subsection{Front-end}
 \label{frontend}
@@ -252,13 +246,13 @@ If a node is deemed invalid, a reason is given, e.g. "Cardinality violation (min
 \end{figure}
 \section{Evaluation} \label{section:evaluation}
-In this section we evaluated of our tool. Therefore, we first explain the methodology in Section~\ref{section:methodology} and second measured the runtime of our tool with different input parameter in Section~\ref{section:runtime}. Further, we tested correctness of the genreated \emph{ShEx}-constraints and also cross validated them in Section~\ref{section:correctness}.
+In this section we evaluate our tool. We explain the methodology in Section~\ref{section:methodology}. In Section~\ref{section:runtime} we measured the runtime of our tool with different input parameters. Furthermore, we tested correctness of the generated \emph{ShEx}-constraints and also cross validated them in Section~\ref{section:correctness}.
 \subsection{Methodology}
 \label{section:methodology}
 For taking measurements, the application was started locally on our hardware. This was done to minimise side-effects of other applications running on the virtual machine where the live-instance is deployed. We used a machine with a \emph{Ryzen 9 3900x} CPU with 12x3.8GHz cores, DDR4 RAM and an SSD.
-Additionally, the JVM was set up to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. \Kristina{This sentence is very long, maybe we can split it somehow?}
+Additionally, the JVM was set up to use up to 16 GB of main memory for its heap to allow parallel queries without compromising the runtime of the executions, arising from extensive swap usage. 
 \subsection{Runtime}
 \label{section:runtime}
@@ -291,7 +285,7 @@ Figures~\ref{fig:exec_times_per_limit} and \ref{fig:exec_times_per_triples} show
 The results shown in Figure~\ref{fig:exec_times_per_limit} were to be expected. First of all, the runtime of constructing the desired subset of the graph is considerably larger than the time needed to create the \emph{ShEx} constraints, or to validate the constraints on the graph.
 Secondly, the runtime of constructing the subgraph scales with the \emph{LIMIT}. This becomes especially evident in Figures~\ref{fig:limit_canal} and \ref{fig:limit_service}.
-To understand the behaviour shown in Figure~\ref{fig:limit_rbow}, we want to look at Figure~\ref{fig:exec_times_per_triples}, which shows the same runtimes, but grouped by the number of triples in the subgraph, on which the constraints are created. Unlike in Figures~\ref{fig:triple_canal} and \ref{fig:triple_service} the maximum number of triples (shown in the x-coordinate in Figure~\ref{fig:triple_rbow}), is 1769.
+To understand the behaviour shown in Figure~\ref{fig:limit_rbow}, we want to look at Figure~\ref{fig:exec_times_per_triples}, which shows the same runtimes, but grouped by the number of triples in the subgraph on which the constraints are created. As opposed to Figures~\ref{fig:triple_canal} and \ref{fig:triple_service}, the maximum number of triples (shown in the x-coordinate in Figure~\ref{fig:triple_rbow}), is 1769.
 This is also the amount of triples contained in the subgraph that we get without providing any limit. Therefore, providing a limit larger than 200 won't enrich the constructed graph, keeping the time almost constant in regards to the \emph{LIMIT} parameter.
 \begin{figure}[h!]
@@ -337,7 +331,7 @@ Figure~\ref{fig:exec_times_no_limit} shows the runtime without limiting the cons
 \subsection{Correctness}
 \label{section:correctness}
 \subsubsection{ShEx Generation}
-We thought \emph{Shexer}, which was already mentioned in Section~\ref{section:related_work}, was a good fit for cross validating our \emph{ShEx}-generation. However, due to our limited knowledge of operating this tool, we did not manage to generate proper constraints for our RiverBodyOfWater-dataset. Our attempt at using this tool is shown in Figure~\ref{code:shexer}, which generated the trivial, non-restrictive constraints shown in Figure~\ref{fig:shexer_output}.
+We thought \emph{Shexer} (see Section~\ref{section:related_work}) was a good fit for cross validating our \emph{ShEx}-generation. However, due to our limited knowledge of operating this tool, we did not manage to generate proper constraints for our RiverBodyOfWater-dataset. Our attempt at using this tool is shown in Figure~\ref{code:shexer}, which generated the trivial, non-restrictive constraints shown in Figure~\ref{fig:shexer_output}.
 Therefore, we checked the generated constraints manually for small subgraphs (see Figures~\ref{fig:shex_canal_50}, \ref{fig:shex_rbow_50} and \ref{fig:shex_service_50}) and identified two issues with our tool.
@@ -347,7 +341,7 @@ Secondly, optional properties are not always inferred and therefore missing from
 \subsubsection{ShEx Validation}
-\todo{No subsection for just 2 sentences}
+\todo{No subsection for just 2 sentences \Kristina{I think it would be okay to leave the section here, since this is a new topic}}
 The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMIT} 50, Service with \emph{LIMIT} 50) were cross validated using the online-tool \emph{RDFShape}\cite{RDFShape}. The validation result was the same as in our tool.
@@ -359,11 +353,8 @@ The generated \emph{ShEx}-constraints for small subgraphs (Canal with \emph{LIMI
 \end{figure}
 \section{Results} \label{section:results}
-Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. User can also edit constraints.
+Our framework automatically infers constraints and validates the given data based on those constraints. This can be done on two different \emph{CommonCrawl} datasets. The user can choose one of those datasets and a limit using the front-end. In addition, the user edit constraints. Our evaluation in Section~\ref{section:evaluation} showed that the validation of a subgraph works as expected. However, the constraints generated are prone to missing an optional \emph{url} attribute and would benefit from more tests and tuning. In addition, we could also see that the runtime of the tool is rather slow. Possible improvements to our framework are discussed in more depth in Section~\ref{section:future_work}. 
-\Danielle{maybe mention that while the validation of the subgraph works as expected, the constraints generated are prone to missing an optional url attribute and would benefit from more tests and tuning. }
-\todo{describe results of benchmark tests here}
 \section{Future work}
 \label{section:future_work}