CN111639254A

CN111639254A - System and method for generating SPARQL query statement in medical field

Info

Publication number: CN111639254A
Application number: CN202010472760.8A
Authority: CN
Inventors: 李瑞轩; 辜希武; 胡仁; 李玉华
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-08

Abstract

The invention discloses a system and a method for generating SPARQL query sentences in the medical field, and belongs to the field of machine translation. The method comprises the following steps: the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and fills the entities and the attributes into a Chinese question template and an SPARQL query template to generate a training set; the word segmentation module is used for carrying out word segmentation processing on the Chinese question in the training set and forwarding word segmentation results to the learner; performing word segmentation processing on the target Chinese question, and forwarding word segmentation results to an interpreter; the learner is used for training the neural network model according to the Chinese training set after word segmentation to obtain a trained model; the interpreter is used for predicting the segmented target Chinese question by utilizing the trained neural network model to obtain a predicted SPARQL query sentence, so that the medical health query Chinese question is directly converted into the SPARQL query sentence without using complex statistics and manual models.

Description

System and method for generating SPARQL query statement in medical field

Technical Field

The invention belongs to the field of deep learning and machine translation, and particularly relates to a system and a method for generating SPARQL query sentences in the medical field.

Background

In the world of the internet, the amount of information also grows rapidly with the iteration of the technology. For ordinary users, how to accurately obtain the information needed by the users from massive network data becomes a problem. Search engines have emerged to address this problem, and are now one of the most important ways for users to obtain information from the internet. The search engine can be summarized into two implementation modes, namely a traditional keyword matching mode and another mode which can be classified as a future semantic query graph based mode. Regardless of the technology used, the purpose of search engines is to collect consolidated data over the internet to provide users with search query services. Since the advent of search engines, the search engines are in a dominating position, and with the development of the times, new technologies are continuously appeared, and the traditional search engines have a bottleneck and defects in some aspects while providing services for users; for example, the search engine cannot directly understand the sentence input by the user, and only a group of web page links with high to low correlation with the keyword can be returned according to the keyword provided by the user, so that the user needs to search for the knowledge wanted by the user from a large number of web page links by himself instead of letting the search engine understand the problem posed by the user. In 1998, the parent Tim Berners-Lee of the Internet proposed a semantic web to handle the above questions, which organized structured data into a format that can be recognized and inferred by computers, thereby giving computers the ability to screen answers for users to understand questions. The semantic web technology comprises technologies such as RDF, SPARQL, JSON-LD, OWL, RDFS, RIF and the like, wherein the SPARQL is recommended by the official part and is used for semantic web query language.

Because of the increasing internet information, there is already a huge amount of semantic information on the network, and based on the knowledge of the semantic information, there are millions of websites supporting semantic web technology. However, it is still very difficult for users to directly search for information needed by themselves in such a huge amount of data. In other words, it is almost impossible for ordinary users to search for information they need because they do not know the syntactic structure of the semantic web query language such as SPARQL.

There is a long history of the development of automatic question-answering systems, and early development of automatic question-answering systems mainly relied on search engine technology, which is mainly to first query relevant documents from a text source and then extract answers with highest relevance to questions from the queried target documents. And then, a cooperative-based intelligent question-answering system is developed, wherein the system maintains a data set of questions and answers in the background of the system, the returned answer is the question with the highest matching degree among all the questions in the background, and then the answer corresponding to the question is returned. The mainstream technology of the current automatic question-answering system becomes a structured query based on a knowledge base. The technology firstly converts a natural language question sentence into a structured query sentence, such as a common SQL query sentence or a later SPARQL query sentence facing a semantic network, under the condition of understanding the problems proposed by a common user, and a system accurately queries in a knowledge base through the query sentence and then returns a result.

The first step of the knowledge question-answering system is to understand the questions posed by the users, the step plays a crucial role in the knowledge question-answering system, and the system can accurately generate corresponding query sentences and return the query sentences to the users only if the semantic information of the questions is correctly understood. For a question-answering system, because the expression of natural language has diversity and ambiguity, the semantic understanding of a question sentence is incorrect, so that the returned result does not meet the requirements of users. When browsing the internet, people sometimes want to obtain information related to medical health through the internet, even to perform self-diagnosis. However, according to the statistics of data of a plurality of related medical health websites in China, the automatic question answering service system provided by the websites still uses a system based on the traditional search engine technology, and the websites index related webpages by extracting user keywords, so that the users still need to spend a lot of time for screening information. For the field of health care which is closely related to the daily life of people, the domestic intelligent question-answering system based on the medical knowledge base is not mature enough.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a system and a method for generating a SPARQL query statement in the medical field, and aims to directly convert a received Chinese natural language for medical health query into the SPARQL query statement without using complex statistics and manual models.

To achieve the above object, according to a first aspect of the present invention, there is provided a system for generating an SRARQL query statement in a medical field, the system including:

the knowledge base is used for storing knowledge in the medical field, and the knowledge comprises answers of Chinese natural language question sentences;

the query template library is used for storing a Chinese natural language question template and an SPARQL query template, and the two templates correspond to each other one by one;

the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and respectively fills the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set;

the word segmentation module is used for carrying out word segmentation processing on Chinese natural language question sentences in the training data set and forwarding word segmentation results to the learner; performing word segmentation processing on an input target Chinese natural language question sentence, and forwarding a word segmentation result to an interpreter;

the learner is used for training the neural network model according to the Chinese training set after word segmentation processing to obtain a trained neural network model;

and the interpreter is used for predicting the target Chinese natural language question sentence subjected to word segmentation by utilizing the trained neural network model to obtain a predicted SPARQL query sentence.

Preferably, the query template library includes various types of query templates corresponding to various medical knowledge about diseases, where the various medical knowledge includes: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.

Preferably, the function of the generator is implemented by: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.

Preferably, the neural network model is a Seq2Seq model.

Preferably, the system further comprises: and the pre-training module is used for generating a word vector matrix by using the extra corpus and pre-training the embedded layer of the neural network model.

To achieve the above object, according to a second aspect of the present invention, there is provided a method for generating a SPARQL query statement in the medical field, the method including the steps of:

s1, constructing a knowledge base for storing knowledge in the medical field and a query template base for storing a Chinese natural language question template and a SPARQL query template, wherein the knowledge comprises answers of the Chinese natural language question, and the two templates correspond to each other one by one;

s2, extracting entities and attributes from the knowledge base, and respectively filling the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set;

s3, performing word segmentation processing on Chinese natural language question sentences in the training data set;

s4, training the neural network model according to the Chinese training set after word segmentation processing to obtain a trained neural network model;

and S5, performing word segmentation on the input target Chinese natural language question, and predicting the word-segmented target Chinese natural language question by using the trained neural network model to obtain a predicted SPARQL query sentence.

Preferably, step S2 is specifically: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.

Preferably, the neural network model is a Seq2Seq model.

Preferably, the method further comprises, between step S3 and step S4: and generating a word vector matrix by using an additional corpus, and pre-training an embedded layer of the neural network model.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the method for generating the natural language to the SPARQL sentence based on the deep learning combines the semantic characteristics of the Chinese natural language and the deep learning algorithm, the deep learning algorithm generates a predicted SPARQL sentence by translating the Chinese natural language question after word segmentation and the corresponding SPARQL into input data, and can learn the conversion from the Chinese natural language question to the SPARQL sentence, the word vectors of all words and the mapping relation between the word vectors of the Chinese dictionary and the SPARQL dictionary from the input data, so that the Chinese natural language received by the model is directly converted into the SPARQL sentence without using complex statistics and manual models, in other words, the whole natural language expression is converted into the final query by using an end-to-end method.

(2) The problem analysis module is constructed by using the query template and the word segmentation module, the query template combines the prior knowledge in the knowledge base, the knowledge is learned by a deep learning-based natural language to SPARQL translation model, and the problems of named entity identification, information extraction, logic form conversion and the like are jointly solved.

(3) The word vector matrix is generated by utilizing the extra corpus, the embedded layer of the neural network model is pre-trained, the trained embedded layer can enter the SPARQL translation model based on deep learning to carry out formal supervised training, and at the moment, the common word and word vectors are retrained more, so that the problem of word mismatch is solved.

Drawings

Fig. 1 is a diagram of a system for generating SPARQL query statements in the medical field according to the present invention;

FIG. 2 is a diagram of a learner model provided in accordance with the present invention;

FIG. 3 is a diagram of an interpreter model provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present invention provides a system for generating SPARQL query statements in the medical field, the system including:

and the knowledge base is used for storing knowledge in the medical field, and the knowledge comprises answers of Chinese natural language question sentences.

And the query template library is used for storing a Chinese natural language question template and a SPARQL query template, and the two templates correspond to each other one by one.

And the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and respectively fills the entities and the attributes into the Chinese natural language question template and the SPARQL query template so as to generate a training data set.

The word segmentation module is used for carrying out word segmentation processing on Chinese natural language question sentences in the training data set and forwarding word segmentation results to the learner; and performing word segmentation processing on the input target Chinese natural language question, and forwarding a word segmentation result to an interpreter.

And the learner is used for training the neural network model according to the Chinese training set after word segmentation processing to obtain the trained neural network model.

The whole system mainly comprises three components: a generator, a learner and an interpreter. The three areas in fig. 1 are used to distinguish the workflow of the three main components, the knowledge base plays a role in providing data in both the generator and the interpreter, and the model learned by the learner from the training data is used by the interpreter for the translation work of chinese question sentences to SPARQL.

Knowledge base

In this embodiment, Prot g is selected as an ontology editing tool, based on a chinese symptom library provided in an OpenKG website, and partial data in domestic health medical websites such as medical research and inquiry, a + medical encyclopedia and the like are crawled to generate a medical field knowledge base.

Query template library

The query template library contains various types of query templates corresponding to various medical knowledge of diseases, including but not limited to the following types: the disease causes, the prevention of the disease, the symptoms of the disease, the examination of the disease, the treatment of the disease, and the query template corresponding to the care of the disease.

The query template consists of two parts: a Chinese natural language question template for representing inputs understandable and receivable by the neural network model; a SPARQL query template for directly as a result of the neural network model's expected output. Both the SPARQL query template and the chinese natural language question template have placeholders. The present embodiment writes 37 query templates, each relating to one or two entities and a plurality of attributes.

And compiling a query template, combining the prior knowledge in the knowledge base, and a natural language query closely combined with the knowledge base and a SPARQL query corresponding to the natural language query, wherein the entity involved in the template is replaced by a placeholder. For example, when a query template is written to query what symptoms a disease has, the "what symptoms" and "Select? x where { < >: symptoms associated with the disease? x } are respectively used as the contents of the Chinese natural language question sentence template and the SPARQL query template.

Generator

The generator is a pilot part of the deep neural network model, and the generator mainly has the function of connecting the model with a knowledge base to help the model to understand and learn the knowledge base. The generator takes a knowledge base and a query template as input, and outputs a Chinese natural language question and a SPARQL mark sequence as a data set for a learner to learn.

For each template, a certain number of entities and attribute values that satisfy the corresponding SPARQL graph schema need to be retrieved from the domain knowledge base. In order to speed up the generation of data, the values of the process come from the entity and attribute value list extracted and classified in advance. Random sampling is carried out in the list, and IRIs (international resource Identifiers) of the entities and corresponding Chinese labels (entities/attributes) are respectively filled in placeholders of the two templates, so that a pair of Chinese question sentences and SPARQL template sentence pairs can be obtained. Entities of medical domain knowledge include: disease name, drug name, symptom, etc., and attributes of medical domain knowledge include: treatment modalities, disease care modalities, treatment cycles, and the like.

In this example, a value, such as "pneumonia", is randomly selected from a list of disease entities. Filling pneumonia into placeholders of two templates to obtain Chinese natural language question sentence of what symptom the pneumonia has, and obtain SPARQL sentence of "Select? x where { < pneumonia >: symptoms associated with the disease? x } ".

Learning device

The learner is used for understanding and absorbing prior knowledge in the knowledge base, question sentences related to the medical knowledge base and conversion rules between the question sentences and SPARQL sentences, and then converting the understood result into a deep neural model.

The learner adopts a deep neural network Seq2Seq, which takes a Chinese natural language question after word segmentation and a corresponding SPARQL translation as input to generate a predicted SPARQL sentence, and learns by utilizing a gradient descent algorithm. The total layer is divided into four layers: an embedding layer, an encoding layer, a decoding layer, an output layer, wherein,

the embedded layer is responsible for converting Chinese words into word vectors, the number of neurons of the embedded layer is the length of the Chinese word vectors, the output layer is a classic Softmax layer, the number of neurons of the embedded layer is the size of a SPARQL dictionary, and each neuron corresponds to a word in the SPARQL dictionary.

The coding layer and the decoding layer are respectively used for understanding Chinese natural language question sentences and generating SPARQL sentences, which are operated in reverse directions, so that the coding layer and the decoding layer can be combined together to be a hidden layer. The hidden layer receives a sequence of word vectors from the embedded layer

As input, a set of N number of outputs with length | D_sReal-valued vector of |, D_sFor the SPARQL dictionary, each real-valued vector represents the activation value of all words in the SPARQL dictionary at that output point.

The activation value passes through an output layer, and a group of translation sequences y ═ y is output⁽¹⁾，...，y^(N)Are }, and

the comparison is made and a training error for the learner is generated.

From equation (1-1), the training goal of the model is to maximize by x_inputIs conditional

The generation probability of (2).

The coding and decoding layers that make up the hidden layer are actually assembled from an encoder and a decoder and the intermediate context vectors. Encoder L_encoderIs a one-way LSTM, and can be replaced with a two-way LSTM, the number of layers determines the ability of the model to capture complex structures and whether the high-level features can be more accurately extracted from the text window. With the introduction of the attention mechanism, the encoder reads the input text windows in time sequenceWord vector sequence x_hiddenFirst half section { u⁽¹⁾，...，u^(M)Then M hidden states are generated as the output of the encoder instead of one context vector, see equations (1-2) and (1-3).

h^(t)＝f(u^(t)，h^(t-1)) (1-2)

L_encoder(u⁽¹⁾，...，u^(M))＝{h⁽¹⁾，...，h^M} (1-3)

h^(t)For the hidden layer of the encoder, the state t at time t is (1, 2, 3.) passed through the state h at the previous time^(t ^-1)And the word vector u at the current time t^(t)Is calculated to obtain h⁽⁰⁾Initialized to an all zero state. In the formula, f is a nonlinear conversion function, and in the LSTM, is a combination of a series of complex functions, which will be described in detail below.

To control the hidden state h^(t)In the updating method of (1), the LSTM model has an internal state in the hidden layer, which is generally called a memory cell and marked as C^t. In addition, the hidden layer has three gate control structures, namely a forgetting gate, an input gate and an output gate, wherein the three gates have different functions and cooperate with each other to control the output of the next hidden state. For any time t, all three gates receive the input u of the current time model^(t)And hidden state h of the previous moment^(t-1)As input, after passing the activation function, three vectors f are generated, of equal length to the hidden state^(t)、i^(t)、o^(t). The output vectors of the three gates are shown in formula (1-4), formula (1-5) and formula (1-6).

f^(t)＝σ(W_f·h^(t-1)+U_f·u^(t)) (1-4)

i^(t)＝σ(W_i·h^(t-1)+U_i·u^(t)) (1-5)

o^(t)＝σ(W_o·h^(t-1)+U_o·u^(t)) (1-6)

Forget gate passing forget vector f^(t)Controlling how much proportion of the memory cells at the previous time can be retained. In the formula, sigma is Sigmoid function, and each element of the vector obtained by calculation of the function is in [0, 1 ]]And the ratio of the content of the memory cell reserved at the last moment is represented.

Input gate by input vector i^(t)Controlling the proportion of the input value of the current time which is input by the input u of the current time model^(t)And hidden state h of the previous moment^(t-1)The vector of the input gate determines how many input values are added to the hidden memory unit with a certain probability, and the memory unit is updated. The input value at the current moment and the updating of the memory unit are shown in the formulas (1-7) and (1-8).

a^(t)＝tanh(W_a·h^(t-1)+U_a·u^(t)) (1-7)

C^(t)＝f^(t)⊙C^(t-1)+i^(t)⊙a^(t)(1-8)

Wherein "⊙" represents the Hadamard product of two matrices, i.e. two matrices or vectors with identical shapes, the corresponding elements are multiplied one by one, the product result is a matrix or vector with unchanged shape, C⁰And h⁰As such, an all zero state is initialized.

Output gate last pass output vector o^(t)Controlling how much proportion of the memory unit is converted into the hidden state at the current moment. The hidden state update at the current moment is shown in the formula (1-9).

h^(t)＝o^(t)⊙tanh(C^(t)) (1-9)

Decoder L_decoderThe process of (1) is mirror image operation of the encoder, receives the hidden state sequence output by the encoder, and reads the word vector sequence x of the input text window according to the time sequence_hidSecond half segment { v⁽¹⁾，...，v^(N)And generates a set of real-valued vectors o ═ o⁽¹⁾，...，o^(N)As output, where o^(t)From the hidden state H at that moment^(t)Determined as shown in equations (1-10), equations (1-11) and equations (1-12).

o^(t)＝σ(VH^(t)) (1-11)

L_decoder(v⁽¹⁾，...，v^(N)，h⁽¹⁾，...，h^(M))＝{o⁽¹⁾，...，o^(N)} (1-12)

H^(t)For the state of the hidden layer of the decoder at time t, the sequence of hidden states output by the encoder, state H at the previous time^(t-1)And the vector of expected results of the previous output layer

Word vector v^(t-1)Calculated F is a vector normalization comparison function and is responsible for calculating the attention proportion allocated to each hidden state H⁽⁰⁾Is an all-zero vector, v⁽⁰⁾Reading the sentence end symbol of question sentence in natural language to show that the coder is finished and the decoder is started. Learning phase, decoding layer output vector o^(t)Only used for error calculation, does not participate in the calculation of the hidden layer state at the next moment, and replaces the word vector v which participates in the calculation and is expected to result^(t)The goal is to modify the parameters quickly so that the output results more quickly approach the desired results.

The output layer of the learner adopts a classical Softmax layer, and the activation function used by the nodes of the layer is a popularization of logic functions and is called a Softmax function. The output layer receives the output vector sequence o of the decoding layer⁽¹⁾，...，o^(N)H, a real-valued vector o of length d⁽ⁱ⁾Comparing with all words in SPARQL dictionary, and converting the comparison result into length | D_sReal-valued vector of (o)⁽ⁱ⁾) Normalized to y by a Softmax function⁽ⁱ⁾. Equations (1-13) are definitions of the output layers.

The significance of this function is to decode each set | D of layer outputs_sThe real value vector of | dimension (o)⁽ⁱ⁾) The compression mapping is the same dimension, with a value of [0, 1 ]]And a real-valued vector σ of 1 (o)⁽ⁱ⁾) That is, ensure that the value range of the nodes of all output layers is limited to [0, 1 ]]And the sum of all node output values is 1. The layer can ensure that a series of real values output are a legal probability distribution, and the next result output is carried out according to the probability distribution. Each node in the output layer corresponds to a word in the SPARQL dictionary, the output value of the output layer is the activation probability obtained by comparison and normalization calculation according to the real value vector received by the output layer, and finally, the word with the highest activation probability is selected by using a greedy method, and the word is the primary output result of the output layer. As shown in equations (1-14) and (1-15), according to the output of the decoding layer { o }⁽¹⁾，...，o^(N)The output layer finally outputs a group of coherent word sequences { w ] conforming to the semantics_s ⁽¹⁾，...，w_s ^(N)}。

L_output(o⁽¹⁾，...，o^(N))＝{w_s ⁽¹⁾，...，w_s ^(N)} (1-14)

w_s ⁽ⁱ⁾＝greedy(y⁽ⁱ⁾) (1-15)

Wherein greedy (·) represents a greedy algorithm, the word sequence output by the learner has no purpose, and y is used for finally calculating the training error⁽¹⁾，...，y^(N)}。

Fig. 2 shows the "what symptom of pneumonia" in the model learning chinese and the SPARQL sentence "Select? x where { < pneumonia >: symptoms associated with the disease? x } "and the corresponding relation. In fig. 2, the left half of the hidden layer is encoding process, and the right half is decoding process.

Interpreter

The interpreter in the invention reuses the neural network model learned by the learner, receives Chinese natural language question sentences, translates the Chinese natural language question sentences into SPARQL query through the neural network model, and executes the query and obtains the question-answer result of the query.

The interpreter corresponds to the reasoning process or question-and-answer process of the model. The interpreter reuses the neural network model of the learner, receives Chinese natural language question sentences, translates the Chinese natural language question sentences into SPARQL query through the neural network model, and executes and obtains the query and answer results. The forward propagation flow of the model of the interpreter in the embedding layer and the encoding layer is the same as that of the learner, and only the decoding layer and the output layer are slightly different, specifically, as shown in fig. 3, the interpreter takes the result predicted by the decoding layer in the previous step as the input of the current decoding layer, and the output of the interpreter is the result of a plurality of different probabilities. Corresponding to the legend of the learner, fig. 3 shows that the model converts the chinese language "what symptoms are pneumonia" into "Select? xwhere { < pneumonia >: symptoms associated with the disease? x } "in the sequence.

The output layer of the interpreter is similar to that of the learner, and each output vector o of the decoding layer can be decoded as well_iCompression mapping into a probability vector σ ((o)ⁱ) Only one or more words from the SPARQL dictionary can be selected as output based on the probability vectors.

The selection algorithm can be divided into three types from simple to complex, the first method is equal to an output layer of a learner, words in a dictionary corresponding to the index of the maximum numerical value in the probability vector are selected by using a greedy method, the method is simple and efficient, each selection is the optimal solution when the word sequence is fixed and output before, and the optimal solution is easy to fall into the local optimal solution. The second method is to sample according to probability distribution, the method is controlled by a temperature parameter T, random sampling is carried out after a probability vector is calculated, and the higher the activation probability of a word is, the higher the probability of selecting the word in the sampling is. In this case, the formula of Softmax for the temperature parameter is then given in equations (1-16).

Wherein T is a temperature parameter. As the temperature parameter increases, the vector o is output_iThe difference of each component is reduced by times, the obtained probabilities gradually tend to be equal to each other, and the method degenerates to random sampling; while with the temperatureReduction of the number o_iThe difference of each component is multiplied, and the difference between the obtained probabilities becomes larger, so that the method is degraded into a greedy method. Compared with a greedy method, the method adds randomness, so that each time of output is not fixed to a certain optimal solution, and the possibility of obtaining a global optimal solution when the optimal solution is not the optimal solution is increased. The third method, referred to as cluster search, is a heuristic graph search algorithm. If the word output in each step is taken as a node, all the possibilities are considered, and the fixed output length is N, then the first step can generate | D '| nodes and routes, and the second step can generate | D' |²Nodes and routes, and so on, the final step will generate | D'^NThe node and the route, in these routes, must contain such a route, and the node that the route passes through is the global optimal solution. Under the condition that the solution space of the graph is large, in order to reduce the space and time occupied by searching, a breadth-first strategy is generally used for establishing a search tree, nodes are sequenced according to heuristic cost or activation probability when the depth of each step is expanded, then nodes with a predetermined number are left according to a bundling width parameter, the nodes represent nodes with higher quality, the nodes with poor residual quality are cut off, and only the nodes are kept to be expanded continuously at the next level. This reduces space consumption and improves time efficiency, but has the disadvantage that there is a potential for the best solution to be discarded.

Word segmentation module

From the data, not only the conversion from the Chinese natural language question sentence to the SPARQL sentence can be learned, but also the word vectors of all the words and the mapping relation between the word vectors of the Chinese dictionary and the SPARQL dictionary can be learned. Chinese sentences cannot be used as input to a neural network model without processing. Therefore, no matter the Chinese natural language question generated by the generator through the query template or the question received by the interpreter needs to be processed by the word segmentation module, converted into a Chinese word sequence taking a space as a separator and then input into the neural network model for training or reasoning.

The system further comprises: and the pre-training module is used for generating a word vector matrix by using the extra corpus and pre-training the embedded layer of the neural network model so as to solve the problem of word mismatching.

The method adopts a negative sampling algorithm to improve the effect of training the uncommon words, and the embedded layer after training can enter the SPARQL translation model based on deep learning to carry out formal supervised training, and at the moment, the method trains more frequently the vectors of the common words and words. When performing embedding layer pre-training, the system randomly selects 2s +1 consecutive word sequences (s represents the number of words in the context of the target word) from the corpus, takes out the words located in the middle and the remaining word sequences, and uses one of them as the training input of the model and the other as the desired output. The embedded layer of the model is responsible for converting one or more input words into corresponding word vectors, then the probability of expected output is obtained through calculation of the hidden layer and the output layer in sequence, a negative value of a log-likelihood function of the probability is solved and used as a model error, and then parameter correction of the hidden layer and the embedded layer is carried out by using a back propagation algorithm, wherein the parameter of the embedded layer is mainly corrected until the posterior probability of the model on a corpus is maximum, namely the error is reduced to the minimum. At this time, it can be said that the word vector matrix of the embedding layer has learned semantic knowledge in the corpus. After the training is finished, the hidden layer and the output layer are removed, and the parameters corresponding to the embedded layer are word vector matrixes, namely the result of pre-training.

The pre-trained corpus is different from a deep learning-based natural language to a training corpus of an SPARQL translation model, the pre-trained corpus can be any Chinese text and is very convenient to obtain. However, the pre-training process is also a process for determining a chinese dictionary, so that after the random selection of any chinese text and the pre-training are completed, the subsequent training is performed by inputting the natural language based on deep learning into the chinese neural language model according to the corpus of the SPARQL translation model, so as to add the chinese words only appearing in the knowledge base, not in the corpus, into the chinese dictionary. The embedded layer pre-training process not only can achieve the same purpose of expanding the query template, but also can shorten the time from the subsequent training of the natural language based on deep learning to the SPARQL translation model because the trained word vector matrix is directly provided.

In this embodiment, a generic word2vec model is adopted, unsupervised training is performed on about 80 ten thousand chinese medical treatment session data sets, and finally, word vectors of words are obtained.

Correspondingly, the invention provides a system for generating the SPARQL query statement in the medical field, which comprises:

s1, constructing a knowledge base for storing knowledge in the medical field and a query template base for storing a Chinese natural language question template and a SPARQL query template, wherein the knowledge comprises answers of the Chinese natural language question, and the two templates are in one-to-one correspondence.

And S2, extracting entities and attributes from the knowledge base, and respectively filling the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set.

And S3, performing word segmentation processing on the Chinese natural language question in the training data set.

And S4, training the neural network model according to the Chinese training set after word segmentation processing to obtain the trained neural network model.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A system for generating SPARQL query sentences in the medical field, the system comprising:

2. The system of claim 1, wherein the query template library comprises various types of query templates corresponding to various medical knowledge of diseases, the various medical knowledge comprising: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.

3. The system of claim 1 or 2, wherein the function of the generator is implemented by: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.

4. The system of any one of claims 1 to 3, wherein the neural network model is a Seq2Seq model.

5. The system of any one of claims 1 to 4, further comprising: and the pre-training module is used for generating a word vector matrix by using the extra corpus and pre-training the embedded layer of the neural network model.

6. A SPARQL query statement generation method in the medical field is characterized by comprising the following steps:

7. The method of claim 6, wherein the query template library comprises various types of query templates corresponding to various medical knowledge of diseases, and the various medical knowledge comprises: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.

8. The method according to claim 6 or 7, wherein step S2 is specifically: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.

9. The method of any one of claims 6 to 8, wherein the neural network model is a Seq2Seq model.

10. The method of any one of claims 6 to 9, further comprising, between step S3 and step S4: and generating a word vector matrix by using an additional corpus, and pre-training an embedded layer of the neural network model.