CN111639254A - System and method for generating SPARQL query statement in medical field - Google Patents

System and method for generating SPARQL query statement in medical field Download PDF

Info

Publication number
CN111639254A
CN111639254A CN202010472760.8A CN202010472760A CN111639254A CN 111639254 A CN111639254 A CN 111639254A CN 202010472760 A CN202010472760 A CN 202010472760A CN 111639254 A CN111639254 A CN 111639254A
Authority
CN
China
Prior art keywords
chinese
natural language
query
word segmentation
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010472760.8A
Other languages
Chinese (zh)
Inventor
李瑞轩
辜希武
胡仁
李玉华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202010472760.8A priority Critical patent/CN111639254A/en
Publication of CN111639254A publication Critical patent/CN111639254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a system and a method for generating SPARQL query sentences in the medical field, and belongs to the field of machine translation. The method comprises the following steps: the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and fills the entities and the attributes into a Chinese question template and an SPARQL query template to generate a training set; the word segmentation module is used for carrying out word segmentation processing on the Chinese question in the training set and forwarding word segmentation results to the learner; performing word segmentation processing on the target Chinese question, and forwarding word segmentation results to an interpreter; the learner is used for training the neural network model according to the Chinese training set after word segmentation to obtain a trained model; the interpreter is used for predicting the segmented target Chinese question by utilizing the trained neural network model to obtain a predicted SPARQL query sentence, so that the medical health query Chinese question is directly converted into the SPARQL query sentence without using complex statistics and manual models.

Description

System and method for generating SPARQL query statement in medical field
Technical Field
The invention belongs to the field of deep learning and machine translation, and particularly relates to a system and a method for generating SPARQL query sentences in the medical field.
Background
In the world of the internet, the amount of information also grows rapidly with the iteration of the technology. For ordinary users, how to accurately obtain the information needed by the users from massive network data becomes a problem. Search engines have emerged to address this problem, and are now one of the most important ways for users to obtain information from the internet. The search engine can be summarized into two implementation modes, namely a traditional keyword matching mode and another mode which can be classified as a future semantic query graph based mode. Regardless of the technology used, the purpose of search engines is to collect consolidated data over the internet to provide users with search query services. Since the advent of search engines, the search engines are in a dominating position, and with the development of the times, new technologies are continuously appeared, and the traditional search engines have a bottleneck and defects in some aspects while providing services for users; for example, the search engine cannot directly understand the sentence input by the user, and only a group of web page links with high to low correlation with the keyword can be returned according to the keyword provided by the user, so that the user needs to search for the knowledge wanted by the user from a large number of web page links by himself instead of letting the search engine understand the problem posed by the user. In 1998, the parent Tim Berners-Lee of the Internet proposed a semantic web to handle the above questions, which organized structured data into a format that can be recognized and inferred by computers, thereby giving computers the ability to screen answers for users to understand questions. The semantic web technology comprises technologies such as RDF, SPARQL, JSON-LD, OWL, RDFS, RIF and the like, wherein the SPARQL is recommended by the official part and is used for semantic web query language.
Because of the increasing internet information, there is already a huge amount of semantic information on the network, and based on the knowledge of the semantic information, there are millions of websites supporting semantic web technology. However, it is still very difficult for users to directly search for information needed by themselves in such a huge amount of data. In other words, it is almost impossible for ordinary users to search for information they need because they do not know the syntactic structure of the semantic web query language such as SPARQL.
There is a long history of the development of automatic question-answering systems, and early development of automatic question-answering systems mainly relied on search engine technology, which is mainly to first query relevant documents from a text source and then extract answers with highest relevance to questions from the queried target documents. And then, a cooperative-based intelligent question-answering system is developed, wherein the system maintains a data set of questions and answers in the background of the system, the returned answer is the question with the highest matching degree among all the questions in the background, and then the answer corresponding to the question is returned. The mainstream technology of the current automatic question-answering system becomes a structured query based on a knowledge base. The technology firstly converts a natural language question sentence into a structured query sentence, such as a common SQL query sentence or a later SPARQL query sentence facing a semantic network, under the condition of understanding the problems proposed by a common user, and a system accurately queries in a knowledge base through the query sentence and then returns a result.
The first step of the knowledge question-answering system is to understand the questions posed by the users, the step plays a crucial role in the knowledge question-answering system, and the system can accurately generate corresponding query sentences and return the query sentences to the users only if the semantic information of the questions is correctly understood. For a question-answering system, because the expression of natural language has diversity and ambiguity, the semantic understanding of a question sentence is incorrect, so that the returned result does not meet the requirements of users. When browsing the internet, people sometimes want to obtain information related to medical health through the internet, even to perform self-diagnosis. However, according to the statistics of data of a plurality of related medical health websites in China, the automatic question answering service system provided by the websites still uses a system based on the traditional search engine technology, and the websites index related webpages by extracting user keywords, so that the users still need to spend a lot of time for screening information. For the field of health care which is closely related to the daily life of people, the domestic intelligent question-answering system based on the medical knowledge base is not mature enough.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a system and a method for generating a SPARQL query statement in the medical field, and aims to directly convert a received Chinese natural language for medical health query into the SPARQL query statement without using complex statistics and manual models.
To achieve the above object, according to a first aspect of the present invention, there is provided a system for generating an SRARQL query statement in a medical field, the system including:
the knowledge base is used for storing knowledge in the medical field, and the knowledge comprises answers of Chinese natural language question sentences;
the query template library is used for storing a Chinese natural language question template and an SPARQL query template, and the two templates correspond to each other one by one;
the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and respectively fills the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set;
the word segmentation module is used for carrying out word segmentation processing on Chinese natural language question sentences in the training data set and forwarding word segmentation results to the learner; performing word segmentation processing on an input target Chinese natural language question sentence, and forwarding a word segmentation result to an interpreter;
the learner is used for training the neural network model according to the Chinese training set after word segmentation processing to obtain a trained neural network model;
and the interpreter is used for predicting the target Chinese natural language question sentence subjected to word segmentation by utilizing the trained neural network model to obtain a predicted SPARQL query sentence.
Preferably, the query template library includes various types of query templates corresponding to various medical knowledge about diseases, where the various medical knowledge includes: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.
Preferably, the function of the generator is implemented by: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.
Preferably, the neural network model is a Seq2Seq model.
Preferably, the system further comprises: and the pre-training module is used for generating a word vector matrix by using the extra corpus and pre-training the embedded layer of the neural network model.
To achieve the above object, according to a second aspect of the present invention, there is provided a method for generating a SPARQL query statement in the medical field, the method including the steps of:
s1, constructing a knowledge base for storing knowledge in the medical field and a query template base for storing a Chinese natural language question template and a SPARQL query template, wherein the knowledge comprises answers of the Chinese natural language question, and the two templates correspond to each other one by one;
s2, extracting entities and attributes from the knowledge base, and respectively filling the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set;
s3, performing word segmentation processing on Chinese natural language question sentences in the training data set;
s4, training the neural network model according to the Chinese training set after word segmentation processing to obtain a trained neural network model;
and S5, performing word segmentation on the input target Chinese natural language question, and predicting the word-segmented target Chinese natural language question by using the trained neural network model to obtain a predicted SPARQL query sentence.
Preferably, the query template library includes various types of query templates corresponding to various medical knowledge about diseases, where the various medical knowledge includes: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.
Preferably, step S2 is specifically: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.
Preferably, the neural network model is a Seq2Seq model.
Preferably, the method further comprises, between step S3 and step S4: and generating a word vector matrix by using an additional corpus, and pre-training an embedded layer of the neural network model.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the method for generating the natural language to the SPARQL sentence based on the deep learning combines the semantic characteristics of the Chinese natural language and the deep learning algorithm, the deep learning algorithm generates a predicted SPARQL sentence by translating the Chinese natural language question after word segmentation and the corresponding SPARQL into input data, and can learn the conversion from the Chinese natural language question to the SPARQL sentence, the word vectors of all words and the mapping relation between the word vectors of the Chinese dictionary and the SPARQL dictionary from the input data, so that the Chinese natural language received by the model is directly converted into the SPARQL sentence without using complex statistics and manual models, in other words, the whole natural language expression is converted into the final query by using an end-to-end method.
(2) The problem analysis module is constructed by using the query template and the word segmentation module, the query template combines the prior knowledge in the knowledge base, the knowledge is learned by a deep learning-based natural language to SPARQL translation model, and the problems of named entity identification, information extraction, logic form conversion and the like are jointly solved.
(3) The word vector matrix is generated by utilizing the extra corpus, the embedded layer of the neural network model is pre-trained, the trained embedded layer can enter the SPARQL translation model based on deep learning to carry out formal supervised training, and at the moment, the common word and word vectors are retrained more, so that the problem of word mismatch is solved.
Drawings
Fig. 1 is a diagram of a system for generating SPARQL query statements in the medical field according to the present invention;
FIG. 2 is a diagram of a learner model provided in accordance with the present invention;
FIG. 3 is a diagram of an interpreter model provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the present invention provides a system for generating SPARQL query statements in the medical field, the system including:
and the knowledge base is used for storing knowledge in the medical field, and the knowledge comprises answers of Chinese natural language question sentences.
And the query template library is used for storing a Chinese natural language question template and a SPARQL query template, and the two templates correspond to each other one by one.
And the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and respectively fills the entities and the attributes into the Chinese natural language question template and the SPARQL query template so as to generate a training data set.
The word segmentation module is used for carrying out word segmentation processing on Chinese natural language question sentences in the training data set and forwarding word segmentation results to the learner; and performing word segmentation processing on the input target Chinese natural language question, and forwarding a word segmentation result to an interpreter.
And the learner is used for training the neural network model according to the Chinese training set after word segmentation processing to obtain the trained neural network model.
And the interpreter is used for predicting the target Chinese natural language question sentence subjected to word segmentation by utilizing the trained neural network model to obtain a predicted SPARQL query sentence.
The whole system mainly comprises three components: a generator, a learner and an interpreter. The three areas in fig. 1 are used to distinguish the workflow of the three main components, the knowledge base plays a role in providing data in both the generator and the interpreter, and the model learned by the learner from the training data is used by the interpreter for the translation work of chinese question sentences to SPARQL.
Knowledge base
In this embodiment, Prot g is selected as an ontology editing tool, based on a chinese symptom library provided in an OpenKG website, and partial data in domestic health medical websites such as medical research and inquiry, a + medical encyclopedia and the like are crawled to generate a medical field knowledge base.
Query template library
The query template library contains various types of query templates corresponding to various medical knowledge of diseases, including but not limited to the following types: the disease causes, the prevention of the disease, the symptoms of the disease, the examination of the disease, the treatment of the disease, and the query template corresponding to the care of the disease.
The query template consists of two parts: a Chinese natural language question template for representing inputs understandable and receivable by the neural network model; a SPARQL query template for directly as a result of the neural network model's expected output. Both the SPARQL query template and the chinese natural language question template have placeholders. The present embodiment writes 37 query templates, each relating to one or two entities and a plurality of attributes.
And compiling a query template, combining the prior knowledge in the knowledge base, and a natural language query closely combined with the knowledge base and a SPARQL query corresponding to the natural language query, wherein the entity involved in the template is replaced by a placeholder. For example, when a query template is written to query what symptoms a disease has, the "what symptoms" and "Select? x where { < >: symptoms associated with the disease? x } are respectively used as the contents of the Chinese natural language question sentence template and the SPARQL query template.
Generator
The generator is a pilot part of the deep neural network model, and the generator mainly has the function of connecting the model with a knowledge base to help the model to understand and learn the knowledge base. The generator takes a knowledge base and a query template as input, and outputs a Chinese natural language question and a SPARQL mark sequence as a data set for a learner to learn.
For each template, a certain number of entities and attribute values that satisfy the corresponding SPARQL graph schema need to be retrieved from the domain knowledge base. In order to speed up the generation of data, the values of the process come from the entity and attribute value list extracted and classified in advance. Random sampling is carried out in the list, and IRIs (international resource Identifiers) of the entities and corresponding Chinese labels (entities/attributes) are respectively filled in placeholders of the two templates, so that a pair of Chinese question sentences and SPARQL template sentence pairs can be obtained. Entities of medical domain knowledge include: disease name, drug name, symptom, etc., and attributes of medical domain knowledge include: treatment modalities, disease care modalities, treatment cycles, and the like.
In this example, a value, such as "pneumonia", is randomly selected from a list of disease entities. Filling pneumonia into placeholders of two templates to obtain Chinese natural language question sentence of what symptom the pneumonia has, and obtain SPARQL sentence of "Select? x where { < pneumonia >: symptoms associated with the disease? x } ".
Learning device
The learner is used for understanding and absorbing prior knowledge in the knowledge base, question sentences related to the medical knowledge base and conversion rules between the question sentences and SPARQL sentences, and then converting the understood result into a deep neural model.
The learner adopts a deep neural network Seq2Seq, which takes a Chinese natural language question after word segmentation and a corresponding SPARQL translation as input to generate a predicted SPARQL sentence, and learns by utilizing a gradient descent algorithm. The total layer is divided into four layers: an embedding layer, an encoding layer, a decoding layer, an output layer, wherein,
the embedded layer is responsible for converting Chinese words into word vectors, the number of neurons of the embedded layer is the length of the Chinese word vectors, the output layer is a classic Softmax layer, the number of neurons of the embedded layer is the size of a SPARQL dictionary, and each neuron corresponds to a word in the SPARQL dictionary.
The coding layer and the decoding layer are respectively used for understanding Chinese natural language question sentences and generating SPARQL sentences, which are operated in reverse directions, so that the coding layer and the decoding layer can be combined together to be a hidden layer. The hidden layer receives a sequence of word vectors from the embedded layer
Figure BDA0002513461500000081
Figure BDA0002513461500000082
As input, a set of N number of outputs with length | DsReal-valued vector of |, DsFor the SPARQL dictionary, each real-valued vector represents the activation value of all words in the SPARQL dictionary at that output point.
The activation value passes through an output layer, and a group of translation sequences y ═ y is output(1),...,y(N)Are }, and
Figure BDA0002513461500000091
the comparison is made and a training error for the learner is generated.
From equation (1-1), the training goal of the model is to maximize by xinputIs conditional
Figure BDA0002513461500000092
The generation probability of (2).
Figure BDA0002513461500000093
The coding and decoding layers that make up the hidden layer are actually assembled from an encoder and a decoder and the intermediate context vectors. Encoder LencoderIs a one-way LSTM, and can be replaced with a two-way LSTM, the number of layers determines the ability of the model to capture complex structures and whether the high-level features can be more accurately extracted from the text window. With the introduction of the attention mechanism, the encoder reads the input text windows in time sequenceWord vector sequence xhiddenFirst half section { u(1),...,u(M)Then M hidden states are generated as the output of the encoder instead of one context vector, see equations (1-2) and (1-3).
h(t)=f(u(t),h(t-1)) (1-2)
Lencoder(u(1),...,u(M))={h(1),...,hM} (1-3)
h(t)For the hidden layer of the encoder, the state t at time t is (1, 2, 3.) passed through the state h at the previous time(t -1)And the word vector u at the current time t(t)Is calculated to obtain h(0)Initialized to an all zero state. In the formula, f is a nonlinear conversion function, and in the LSTM, is a combination of a series of complex functions, which will be described in detail below.
To control the hidden state h(t)In the updating method of (1), the LSTM model has an internal state in the hidden layer, which is generally called a memory cell and marked as Ct. In addition, the hidden layer has three gate control structures, namely a forgetting gate, an input gate and an output gate, wherein the three gates have different functions and cooperate with each other to control the output of the next hidden state. For any time t, all three gates receive the input u of the current time model(t)And hidden state h of the previous moment(t-1)As input, after passing the activation function, three vectors f are generated, of equal length to the hidden state(t)、i(t)、o(t). The output vectors of the three gates are shown in formula (1-4), formula (1-5) and formula (1-6).
f(t)=σ(Wf·h(t-1)+Uf·u(t)) (1-4)
i(t)=σ(Wi·h(t-1)+Ui·u(t)) (1-5)
o(t)=σ(Wo·h(t-1)+Uo·u(t)) (1-6)
Forget gate passing forget vector f(t)Controlling how much proportion of the memory cells at the previous time can be retained. In the formula, sigma is Sigmoid function, and each element of the vector obtained by calculation of the function is in [0, 1 ]]And the ratio of the content of the memory cell reserved at the last moment is represented.
Input gate by input vector i(t)Controlling the proportion of the input value of the current time which is input by the input u of the current time model(t)And hidden state h of the previous moment(t-1)The vector of the input gate determines how many input values are added to the hidden memory unit with a certain probability, and the memory unit is updated. The input value at the current moment and the updating of the memory unit are shown in the formulas (1-7) and (1-8).
a(t)=tanh(Wa·h(t-1)+Ua·u(t)) (1-7)
C(t)=f(t)⊙C(t-1)+i(t)⊙a(t)(1-8)
Wherein "⊙" represents the Hadamard product of two matrices, i.e. two matrices or vectors with identical shapes, the corresponding elements are multiplied one by one, the product result is a matrix or vector with unchanged shape, C0And h0As such, an all zero state is initialized.
Output gate last pass output vector o(t)Controlling how much proportion of the memory unit is converted into the hidden state at the current moment. The hidden state update at the current moment is shown in the formula (1-9).
h(t)=o(t)⊙tanh(C(t)) (1-9)
Decoder LdecoderThe process of (1) is mirror image operation of the encoder, receives the hidden state sequence output by the encoder, and reads the word vector sequence x of the input text window according to the time sequencehidSecond half segment { v(1),...,v(N)And generates a set of real-valued vectors o ═ o(1),...,o(N)As output, where o(t)From the hidden state H at that moment(t)Determined as shown in equations (1-10), equations (1-11) and equations (1-12).
Figure BDA0002513461500000101
o(t)=σ(VH(t)) (1-11)
Ldecoder(v(1),...,v(N),h(1),...,h(M))={o(1),...,o(N)} (1-12)
H(t)For the state of the hidden layer of the decoder at time t, the sequence of hidden states output by the encoder, state H at the previous time(t-1)And the vector of expected results of the previous output layer
Figure BDA0002513461500000111
Word vector v(t-1)Calculated F is a vector normalization comparison function and is responsible for calculating the attention proportion allocated to each hidden state H(0)Is an all-zero vector, v(0)Reading the sentence end symbol of question sentence in natural language to show that the coder is finished and the decoder is started. Learning phase, decoding layer output vector o(t)Only used for error calculation, does not participate in the calculation of the hidden layer state at the next moment, and replaces the word vector v which participates in the calculation and is expected to result(t)The goal is to modify the parameters quickly so that the output results more quickly approach the desired results.
The output layer of the learner adopts a classical Softmax layer, and the activation function used by the nodes of the layer is a popularization of logic functions and is called a Softmax function. The output layer receives the output vector sequence o of the decoding layer(1),...,o(N)H, a real-valued vector o of length d(i)Comparing with all words in SPARQL dictionary, and converting the comparison result into length | DsReal-valued vector of (o)(i)) Normalized to y by a Softmax function(i). Equations (1-13) are definitions of the output layers.
Figure BDA0002513461500000112
The significance of this function is to decode each set | D of layer outputssThe real value vector of | dimension (o)(i)) The compression mapping is the same dimension, with a value of [0, 1 ]]And a real-valued vector σ of 1 (o)(i)) That is, ensure that the value range of the nodes of all output layers is limited to [0, 1 ]]And the sum of all node output values is 1. The layer can ensure that a series of real values output are a legal probability distribution, and the next result output is carried out according to the probability distribution. Each node in the output layer corresponds to a word in the SPARQL dictionary, the output value of the output layer is the activation probability obtained by comparison and normalization calculation according to the real value vector received by the output layer, and finally, the word with the highest activation probability is selected by using a greedy method, and the word is the primary output result of the output layer. As shown in equations (1-14) and (1-15), according to the output of the decoding layer { o }(1),...,o(N)The output layer finally outputs a group of coherent word sequences { w ] conforming to the semanticss (1),...,ws (N)}。
Loutput(o(1),...,o(N))={ws (1),...,ws (N)} (1-14)
ws (i)=greedy(y(i)) (1-15)
Wherein greedy (·) represents a greedy algorithm, the word sequence output by the learner has no purpose, and y is used for finally calculating the training error(1),...,y(N)}。
Fig. 2 shows the "what symptom of pneumonia" in the model learning chinese and the SPARQL sentence "Select? x where { < pneumonia >: symptoms associated with the disease? x } "and the corresponding relation. In fig. 2, the left half of the hidden layer is encoding process, and the right half is decoding process.
Interpreter
The interpreter in the invention reuses the neural network model learned by the learner, receives Chinese natural language question sentences, translates the Chinese natural language question sentences into SPARQL query through the neural network model, and executes the query and obtains the question-answer result of the query.
The interpreter corresponds to the reasoning process or question-and-answer process of the model. The interpreter reuses the neural network model of the learner, receives Chinese natural language question sentences, translates the Chinese natural language question sentences into SPARQL query through the neural network model, and executes and obtains the query and answer results. The forward propagation flow of the model of the interpreter in the embedding layer and the encoding layer is the same as that of the learner, and only the decoding layer and the output layer are slightly different, specifically, as shown in fig. 3, the interpreter takes the result predicted by the decoding layer in the previous step as the input of the current decoding layer, and the output of the interpreter is the result of a plurality of different probabilities. Corresponding to the legend of the learner, fig. 3 shows that the model converts the chinese language "what symptoms are pneumonia" into "Select? xwhere { < pneumonia >: symptoms associated with the disease? x } "in the sequence.
The output layer of the interpreter is similar to that of the learner, and each output vector o of the decoding layer can be decoded as welliCompression mapping into a probability vector σ ((o)i) Only one or more words from the SPARQL dictionary can be selected as output based on the probability vectors.
The selection algorithm can be divided into three types from simple to complex, the first method is equal to an output layer of a learner, words in a dictionary corresponding to the index of the maximum numerical value in the probability vector are selected by using a greedy method, the method is simple and efficient, each selection is the optimal solution when the word sequence is fixed and output before, and the optimal solution is easy to fall into the local optimal solution. The second method is to sample according to probability distribution, the method is controlled by a temperature parameter T, random sampling is carried out after a probability vector is calculated, and the higher the activation probability of a word is, the higher the probability of selecting the word in the sampling is. In this case, the formula of Softmax for the temperature parameter is then given in equations (1-16).
Figure BDA0002513461500000131
Wherein T is a temperature parameter. As the temperature parameter increases, the vector o is outputiThe difference of each component is reduced by times, the obtained probabilities gradually tend to be equal to each other, and the method degenerates to random sampling; while with the temperatureReduction of the number oiThe difference of each component is multiplied, and the difference between the obtained probabilities becomes larger, so that the method is degraded into a greedy method. Compared with a greedy method, the method adds randomness, so that each time of output is not fixed to a certain optimal solution, and the possibility of obtaining a global optimal solution when the optimal solution is not the optimal solution is increased. The third method, referred to as cluster search, is a heuristic graph search algorithm. If the word output in each step is taken as a node, all the possibilities are considered, and the fixed output length is N, then the first step can generate | D '| nodes and routes, and the second step can generate | D' |2Nodes and routes, and so on, the final step will generate | D'NThe node and the route, in these routes, must contain such a route, and the node that the route passes through is the global optimal solution. Under the condition that the solution space of the graph is large, in order to reduce the space and time occupied by searching, a breadth-first strategy is generally used for establishing a search tree, nodes are sequenced according to heuristic cost or activation probability when the depth of each step is expanded, then nodes with a predetermined number are left according to a bundling width parameter, the nodes represent nodes with higher quality, the nodes with poor residual quality are cut off, and only the nodes are kept to be expanded continuously at the next level. This reduces space consumption and improves time efficiency, but has the disadvantage that there is a potential for the best solution to be discarded.
Word segmentation module
From the data, not only the conversion from the Chinese natural language question sentence to the SPARQL sentence can be learned, but also the word vectors of all the words and the mapping relation between the word vectors of the Chinese dictionary and the SPARQL dictionary can be learned. Chinese sentences cannot be used as input to a neural network model without processing. Therefore, no matter the Chinese natural language question generated by the generator through the query template or the question received by the interpreter needs to be processed by the word segmentation module, converted into a Chinese word sequence taking a space as a separator and then input into the neural network model for training or reasoning.
The system further comprises: and the pre-training module is used for generating a word vector matrix by using the extra corpus and pre-training the embedded layer of the neural network model so as to solve the problem of word mismatching.
The method adopts a negative sampling algorithm to improve the effect of training the uncommon words, and the embedded layer after training can enter the SPARQL translation model based on deep learning to carry out formal supervised training, and at the moment, the method trains more frequently the vectors of the common words and words. When performing embedding layer pre-training, the system randomly selects 2s +1 consecutive word sequences (s represents the number of words in the context of the target word) from the corpus, takes out the words located in the middle and the remaining word sequences, and uses one of them as the training input of the model and the other as the desired output. The embedded layer of the model is responsible for converting one or more input words into corresponding word vectors, then the probability of expected output is obtained through calculation of the hidden layer and the output layer in sequence, a negative value of a log-likelihood function of the probability is solved and used as a model error, and then parameter correction of the hidden layer and the embedded layer is carried out by using a back propagation algorithm, wherein the parameter of the embedded layer is mainly corrected until the posterior probability of the model on a corpus is maximum, namely the error is reduced to the minimum. At this time, it can be said that the word vector matrix of the embedding layer has learned semantic knowledge in the corpus. After the training is finished, the hidden layer and the output layer are removed, and the parameters corresponding to the embedded layer are word vector matrixes, namely the result of pre-training.
The pre-trained corpus is different from a deep learning-based natural language to a training corpus of an SPARQL translation model, the pre-trained corpus can be any Chinese text and is very convenient to obtain. However, the pre-training process is also a process for determining a chinese dictionary, so that after the random selection of any chinese text and the pre-training are completed, the subsequent training is performed by inputting the natural language based on deep learning into the chinese neural language model according to the corpus of the SPARQL translation model, so as to add the chinese words only appearing in the knowledge base, not in the corpus, into the chinese dictionary. The embedded layer pre-training process not only can achieve the same purpose of expanding the query template, but also can shorten the time from the subsequent training of the natural language based on deep learning to the SPARQL translation model because the trained word vector matrix is directly provided.
In this embodiment, a generic word2vec model is adopted, unsupervised training is performed on about 80 ten thousand chinese medical treatment session data sets, and finally, word vectors of words are obtained.
Correspondingly, the invention provides a system for generating the SPARQL query statement in the medical field, which comprises:
s1, constructing a knowledge base for storing knowledge in the medical field and a query template base for storing a Chinese natural language question template and a SPARQL query template, wherein the knowledge comprises answers of the Chinese natural language question, and the two templates are in one-to-one correspondence.
And S2, extracting entities and attributes from the knowledge base, and respectively filling the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set.
And S3, performing word segmentation processing on the Chinese natural language question in the training data set.
And S4, training the neural network model according to the Chinese training set after word segmentation processing to obtain the trained neural network model.
And S5, performing word segmentation on the input target Chinese natural language question, and predicting the word-segmented target Chinese natural language question by using the trained neural network model to obtain a predicted SPARQL query sentence.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A system for generating SPARQL query sentences in the medical field, the system comprising:
the knowledge base is used for storing knowledge in the medical field, and the knowledge comprises answers of Chinese natural language question sentences;
the query template library is used for storing a Chinese natural language question template and an SPARQL query template, and the two templates correspond to each other one by one;
the generator takes the query template base and the knowledge base as input, is used for extracting entities and attributes from the knowledge base, and respectively fills the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set;
the word segmentation module is used for carrying out word segmentation processing on Chinese natural language question sentences in the training data set and forwarding word segmentation results to the learner; performing word segmentation processing on an input target Chinese natural language question sentence, and forwarding a word segmentation result to an interpreter;
the learner is used for training the neural network model according to the Chinese training set after word segmentation processing to obtain a trained neural network model;
and the interpreter is used for predicting the target Chinese natural language question sentence subjected to word segmentation by utilizing the trained neural network model to obtain a predicted SPARQL query sentence.
2. The system of claim 1, wherein the query template library comprises various types of query templates corresponding to various medical knowledge of diseases, the various medical knowledge comprising: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.
3. The system of claim 1 or 2, wherein the function of the generator is implemented by: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.
4. The system of any one of claims 1 to 3, wherein the neural network model is a Seq2Seq model.
5. The system of any one of claims 1 to 4, further comprising: and the pre-training module is used for generating a word vector matrix by using the extra corpus and pre-training the embedded layer of the neural network model.
6. A SPARQL query statement generation method in the medical field is characterized by comprising the following steps:
s1, constructing a knowledge base for storing knowledge in the medical field and a query template base for storing a Chinese natural language question template and a SPARQL query template, wherein the knowledge comprises answers of the Chinese natural language question, and the two templates correspond to each other one by one;
s2, extracting entities and attributes from the knowledge base, and respectively filling the entities and the attributes into a Chinese natural language question template and a SPARQL query template so as to generate a training data set;
s3, performing word segmentation processing on Chinese natural language question sentences in the training data set;
s4, training the neural network model according to the Chinese training set after word segmentation processing to obtain a trained neural network model;
and S5, performing word segmentation on the input target Chinese natural language question, and predicting the word-segmented target Chinese natural language question by using the trained neural network model to obtain a predicted SPARQL query sentence.
7. The method of claim 6, wherein the query template library comprises various types of query templates corresponding to various medical knowledge of diseases, and the various medical knowledge comprises: causes of diseases, prevention of diseases, symptoms of diseases, examination of diseases, treatment of diseases, and care of diseases.
8. The method according to claim 6 or 7, wherein step S2 is specifically: and randomly sampling in the entity and attribute value lists, and respectively filling the IRI of the entity and the corresponding Chinese label into the placeholders of the two templates to obtain a pair of Chinese question sentences and SPARQL template sentences.
9. The method of any one of claims 6 to 8, wherein the neural network model is a Seq2Seq model.
10. The method of any one of claims 6 to 9, further comprising, between step S3 and step S4: and generating a word vector matrix by using an additional corpus, and pre-training an embedded layer of the neural network model.
CN202010472760.8A 2020-05-28 2020-05-28 System and method for generating SPARQL query statement in medical field Pending CN111639254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010472760.8A CN111639254A (en) 2020-05-28 2020-05-28 System and method for generating SPARQL query statement in medical field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010472760.8A CN111639254A (en) 2020-05-28 2020-05-28 System and method for generating SPARQL query statement in medical field

Publications (1)

Publication Number Publication Date
CN111639254A true CN111639254A (en) 2020-09-08

Family

ID=72331249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010472760.8A Pending CN111639254A (en) 2020-05-28 2020-05-28 System and method for generating SPARQL query statement in medical field

Country Status (1)

Country Link
CN (1) CN111639254A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN112287093A (en) * 2020-12-02 2021-01-29 上海交通大学 Automatic question-answering system based on semi-supervised learning and Text-to-SQL model
CN112447300A (en) * 2020-11-27 2021-03-05 平安科技(深圳)有限公司 Medical query method and device based on graph neural network, computer equipment and storage medium
CN113537346A (en) * 2021-07-15 2021-10-22 思必驰科技股份有限公司 Medical field data labeling model training method and medical field data labeling method
CN114064820A (en) * 2021-11-29 2022-02-18 上证所信息网络有限公司 Table semantic query rough arrangement method based on hybrid architecture

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010137940A1 (en) * 2009-05-25 2010-12-02 Mimos Berhad A method and system for extendable semantic query interpretation
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN109522393A (en) * 2018-10-11 2019-03-26 平安科技(深圳)有限公司 Intelligent answer method, apparatus, computer equipment and storage medium
US20200004831A1 (en) * 2018-06-27 2020-01-02 Bitdefender IPR Management Ltd. Systems And Methods For Translating Natural Language Sentences Into Database Queries

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010137940A1 (en) * 2009-05-25 2010-12-02 Mimos Berhad A method and system for extendable semantic query interpretation
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
US20200004831A1 (en) * 2018-06-27 2020-01-02 Bitdefender IPR Management Ltd. Systems And Methods For Translating Natural Language Sentences Into Database Queries
CN109522393A (en) * 2018-10-11 2019-03-26 平安科技(深圳)有限公司 Intelligent answer method, apparatus, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SHENG ZHANG ET AL.: "Chinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs" *
TOMMASO SORU ET AL.: "SPARQL as a Foreign Language" *
李东潮: "基于深度学习算法的中文文本与SPARQL的转换方法研究" *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813802A (en) * 2020-09-11 2020-10-23 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN111813802B (en) * 2020-09-11 2021-06-29 杭州量之智能科技有限公司 Method for generating structured query statement based on natural language
CN112447300A (en) * 2020-11-27 2021-03-05 平安科技(深圳)有限公司 Medical query method and device based on graph neural network, computer equipment and storage medium
CN112447300B (en) * 2020-11-27 2024-02-09 平安科技(深圳)有限公司 Medical query method and device based on graph neural network, computer equipment and storage medium
CN112287093A (en) * 2020-12-02 2021-01-29 上海交通大学 Automatic question-answering system based on semi-supervised learning and Text-to-SQL model
CN113537346A (en) * 2021-07-15 2021-10-22 思必驰科技股份有限公司 Medical field data labeling model training method and medical field data labeling method
CN113537346B (en) * 2021-07-15 2023-08-15 思必驰科技股份有限公司 Medical field data labeling model training method and medical field data labeling method
CN114064820A (en) * 2021-11-29 2022-02-18 上证所信息网络有限公司 Table semantic query rough arrangement method based on hybrid architecture
CN114064820B (en) * 2021-11-29 2023-11-24 上证所信息网络有限公司 Mixed architecture-based table semantic query coarse arrangement method

Similar Documents

Publication Publication Date Title
Zhang et al. Neural, symbolic and neural-symbolic reasoning on knowledge graphs
CN107748757B (en) Question-answering method based on knowledge graph
CN111639254A (en) System and method for generating SPARQL query statement in medical field
CN111538848A (en) Knowledge representation learning method fusing multi-source information
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
WO2023225858A1 (en) Reading type examination question generation system and method based on commonsense reasoning
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
Xiong et al. Knowledge graph question answering with semantic oriented fusion model
CN112328800A (en) System and method for automatically generating programming specification question answers
CN113779220A (en) Mongolian multi-hop question-answering method based on three-channel cognitive map and graph attention network
CN114428850B (en) Text retrieval matching method and system
CN113407697A (en) Chinese medical question classification system for deep encyclopedia learning
CN117312499A (en) Big data analysis system and method based on semantics
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN114841353A (en) Quantum language model modeling system fusing syntactic information and application thereof
Garg et al. Temporal restricted boltzmann machines for dependency parsing
CN112417170B (en) Relationship linking method for incomplete knowledge graph
Aakur et al. Leveraging symbolic knowledge bases for commonsense natural language inference using pattern theory
CN117131933A (en) Multi-mode knowledge graph establishing method and application
Pîrtoacă et al. Improving retrieval-based question answering with deep inference models
CN112100342A (en) Knowledge graph question-answering method based on knowledge representation learning technology
Ben Lamine et al. Deep learning-based extraction of concepts: A comparative study and application on medical data
CN114579605A (en) Table question-answer data processing method, electronic equipment and computer storage medium
CN114443818A (en) Dialogue type knowledge base question-answer implementation method
CN115033706A (en) Method for automatically complementing and updating knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200908

WD01 Invention patent application deemed withdrawn after publication