CN117349423A

CN117349423A - Template matching type knowledge question-answering model in water conservancy field

Info

Publication number: CN117349423A
Application number: CN202311438107.XA
Authority: CN
Inventors: 凌飞
Original assignee: Shanghai Xieji Technology Co ltd
Current assignee: Shanghai Xieji Technology Co ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-01-05

Abstract

The invention provides a template matching type knowledge question-answering model in water conservancy fields, which comprises a question acquisition module, a question analysis module, a knowledge graph query module and an answer generation module, wherein the question acquisition module is used for receiving a question request input by a user and outputting the question request, the question acquisition module is used for receiving the question request output by the question acquisition module and carrying out semantic analysis on the question request, the question analysis module is used for obtaining and outputting a question template corresponding to the question request, the question template is used for receiving the question template output by the question analysis module and searching in a knowledge graph, the knowledge graph query module is used for obtaining and outputting an answer corresponding to the question template, the answer output by the knowledge graph query module is used for receiving the answer and processing the answer, and finally the processed answer is fed back to the answer generation module of the user.

Description

Template matching type knowledge question-answering model in water conservancy field

Technical Field

The invention relates to the technical field of water conservancy knowledge question and answer, in particular to a template matching type knowledge question and answer model in the water conservancy field.

Background

In the knowledge question-answering model technology in the water conservancy field, a large number of questions are analyzed and found by collecting common questions of part of water conservancy website forums, the questions have obvious field characteristics, and the following questions exist.

(1) The water conservancy special field relates to more water conservancy professional words, such as 'gate station', 'super warning water level', and the professional words can be identified only by a professional water conservancy field dictionary.

(2) The water conservancy field is commonly provided with place names and river names, such as 'double American bridge', 'Zhang Gubang', 'Changshan gate', and the like. These proper nouns have no semantics, which causes a problem that it is difficult to recognize the intention of a question.

(3) Some spoken expressions such as "where", "please ask" etc. will appear, which will make a noise to the processing of the question.

(4) The multi-word synonym, namely the condition that the word synonyms are different, can appear the omission phenomenon.

Aiming at the problems, the prior art provides a sentence similarity algorithm, which combines the specificity of the water conservancy field, considers the surface layer similarity of sentences based on word information and syntax information, calculates the sentence similarity of grammar and semantic layers by combining HowNet, comprehensively considers multiple information such as keywords, sentence length, semantics and the like, and provides a fusion algorithm to effectively improve the accuracy of sentence semantic recognition.

The sentence similarity algorithm specifically comprises the following steps:

step 1: and preprocessing a question. And carrying out spoken word filtering, query word removal, stop word, passenger gas word and other processes on the question by combining the question features related in the water conservancy question-answering system.

Step 2: and extracting keywords. The keyword extraction is the word segmentation process, in English, space is used as a separator between words, no obvious separator exists between Chinese words and words, and only simple boundaries exist between segments and between sentences.

Word segmentation algorithms fall roughly into three categories: word segmentation algorithm based on character string matching, word segmentation method based on understanding and word segmentation method based on statistics. The scheme uses word segmentation algorithm based on character string matching, the method is also called a mechanical word segmentation method, and the method is to match the character string to be analyzed with the vocabulary entry in a 'full large' machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified).

Step 3: a proper noun dictionary is established. In a question-answering system in the water conservancy field, a plurality of proper nouns can appear, the proper nouns are difficult to identify, and the difficulty of word segmentation is certainly increased. Therefore, a water conservancy professional word segmentation dictionary is established, the dictionary stores the first word of the vocabulary by using a two-dimensional array according to the characteristics of GB2312 codes, and the total number of the first word is 6763 units, and one or more than one vocabulary rest part except the first word is possibly stored by using HashTable. A one-to-many relation between proper nouns and indicated objects is built in dictionary, a data structure is defined to store the relation, and when the vocabulary is identified as proper nouns, the vocabulary is immediately converted into the indicated objects. At this point, the corresponding reference object is stored in the data structure.

Step 4: and (5) calculating the similarity of the surface layer. The water conservancy domain features are considered, and meanwhile, the surface layer similarity is calculated from the aspects of words and shapes, sentence structures, word sequences and the like of the constituent sentences.

Part-of-speech similarity expression:

wherein WordPro (Q) ₁ ) And WordPro (Q) ₂ ) Weights of keywords in sentences Q1 and Q2, respectively, sameWordPro (Q ₁ ,Q ₂ ) The weights of the same keywords in sentences Q1 and Q2 are represented.

The keyword set is obtained after word segmentation, and from the aspect of vocabulary attribute, water conservancy professional words, proprietary names, keywords and spoken language expression words may exist, for example, the water content rate of the polder region is greater than 0.2, and obviously, the water conservancy professional words such as the water content rate of the polder region and the water content rate of the polder region bear more information than the comparison words such as the water conservancy professional words such as the water content rate of the polder region and the water content rate of the polder region are greater than the comparison words, so that the water conservancy professional words are given a larger weight. From the part of speech aspect, the words may contain various word shapes such as nouns, verbs, adjectives, number words and the like, and through a great deal of practice, we find that the nouns and the verbs account for the largest information volume proportion of the whole sentence, namely, the central information expressed by the sentence is spread around the nouns and the verbs, and the nouns are more important than the verbs. Therefore, from two aspects of part of speech and vocabulary attribute, weight the keyword after the segmentation, assign corresponding weight to different parts of speech, increase the degree of accuracy.

In addition, there are three common algorithms based on rules, and the method focuses on information such as morphological changes of words, such as the order of keywords, the distance between the keywords, the length of sentences, and the like, and determines the similarity between sentences through different combinations of the similarity between words.

Sentence length similarity:

wherein Len (Q) ₁ ) And Len (Q) ₂ ) The number of keywords of two sentences is represented respectively. The sentence length reflects the similarity degree of two sentences to a certain extent, and the smaller the sentence length difference is, the larger the similarity degree is.

Word order similarity:

wherein, rev (Q ₁ ,Q ₂ ) Inverse ordinal number of natural number sequence representing position of keyword in sentence Q1 in Q2, maxRev (Q ₁ ,Q ₂ ) The maximum reverse order number of the natural number sequence representing the number of identical keywords in Q1 and Q2. The maximum inverse number is MaxRev (Q) ₁ ,Q ₂ )＝n(n-1)/2。

Distance similarity:

wherein, sameDis (Q) _i ) Representing the distance between the same keyword in Q1 and Q2 in Qi, and if the keyword repeatedly appears, the maximum distance is used as the reference; dis (Q) _i ) (i=1, 2) represents the distance between the leftmost and rightmost keywords among the sentence keywords, and if the keywords repeatedly appear, the minimum distance is taken into consideration.

Combining the three surface layer similarities based on rules, namely sentence length similarity, word order similarity and distance similarity, and word part similarity considering the word part of the keyword, and carrying out linear fusion on the four similarities to obtain the surface layer similarity as follows

SynSim＝λ ₁ LenSim(Q ₁ ,Q ₂ )+λ ₂ OrdSim(Q ₁ ,Q ₂ )+λ ₃ DisSim(Q ₁ ,Q ₂ )+λ ₄ WordProSim(Q ₁ ,Q ₂ ) Formula (5)

Wherein: lambda (lambda) ₁ +λ ₂ +λ ₃ +λ ₄ ＝1，λ ₁ ≥0.5≥λ ₂ ≥λ ₃ ≥λ ₄ 。

Step 5: and (5) calculating semantic similarity. In calculating semantic similarity, a knowledge network (HowNet) is adopted as a semantic knowledge resource. HowNet is a common sense knowledge base that describes concepts represented by words in chinese and english to reveal relationships between concepts and between properties possessed by the concepts as basic content. To calculate the semantic similarity of sentences, firstly calculating the semantic similarity of words, and calculating the similarity between the words by adopting a calculation tool provided by a knowledge network; and then analyzing the question sentence, and calculating the similarity between the sentence and the candidate question sentence in the question-answering system.

Let the sentence Q1 contain m keywords (K11, K12, K13, …, K1 m), and the sentence Q2 contain n keywords (K21, K22, K23, …, K2 n).

And calculating the semantic similarity of the vocabulary. And selecting one keyword from the Q1, and calculating the vocabulary similarity with n keywords in the Q2 respectively until the keywords in the Q1 are circulated. Obtaining a vocabulary semantic similarity matrix:

and calculating the average maximum similarity between the first keyword set and the keywords of the second keyword set.

Calculating the average maximum similarity between the second keyword set and the keywords of the first keyword set;

And (5) calculating the results of the formulas (7) and (8), and averaging to obtain the sentence semantic similarity.

Step 6: and (5) algorithm fusion. Comprehensively considering multiple information, including sentence surface layer similarity and sentence semantic similarity, and defining the similarity of two sentences as follows for the characteristics of the water conservancy field:

Sim＝(1-η)SynSim+ηSemSim,0≤η≤13

experimental verification was performed using "jiaxing river basin 500 questions" as the test question set. Through a large number of tests, determining parameter values in an algorithm, wherein λ1=0.4, λ2=0.2, λ3=0.3, λ4=0.1 and η=0.7, and determining a threshold value to be 0.65, and when the calculated sentence similarity is greater than or equal to the threshold value, indicating that a question corresponding to the maximum similarity and a target question of a user are the same question; otherwise, the sentence does not have similarity with the target sentence.

In question: "what are the polder regions with water surface rates greater than 0.2? By way of example only, the term "as used herein,

candidate question set { "which of the water surface ratios in the polder region are greater than 0.2? "find the polder area with water surface rate greater than 0.2? "what are the polder regions with water surface rates exceeding 0.2? "which water surface rate of the polder region is greater than 0.2? "}, the experimental results are shown in the table.

After the method is adopted, the similarity of sentences Q1 and Q2 reaches the threshold value, meets the standard of answers, is basically consistent with subjective judgment of people, and has high accuracy in the water conservancy field compared with other similarity algorithms.

However, there is still an error, and it is difficult to recognize the semantics, and a situation occurs in which the intention of the problem is misjudged.

Therefore, it is necessary to provide a template matching type knowledge question-answering model in the water conservancy field to solve the technical problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a knowledge question-answering model in the template matching type water conservancy field.

The invention provides a knowledge question-answering model in a template matching type water conservancy field, which comprises the following steps:

the problem acquisition module is used for receiving and outputting a problem request input by a user, and the problem request is presented in a natural language form;

the problem analysis module is used for receiving the problem request output by the problem acquisition module, carrying out semantic analysis on the problem request, and obtaining and outputting a problem template corresponding to the problem request;

the knowledge graph query module is used for receiving the question template output by the question analysis module, searching in the knowledge graph, and obtaining and outputting an answer corresponding to the question template;

and the answer generation module is used for receiving the answer output by the knowledge graph query module, processing the answer and finally feeding the processed answer back to the user.

Preferably, the problem analysis module includes:

The topic entity identification module is used for extracting subject entities and attribute information in the question request;

the question classification module is used for classifying the question request after the subject entity and the attribute information are extracted, and obtaining a classification result;

and the template matching module is used for matching out a problem template according to the classification result.

Preferably, the topic entity identification module and the problem classification module take a BERT pre-training model as an embedded model, identify topic entities and attribute information of the problem request through the BERT pre-training model, and judge the type of the problem request.

Preferably, the problem classification module classifies the problem request by adopting a naive Bayesian classification model.

Preferably, the topic entity recognition module recognizes based on the constructed entity dictionary and attribute dictionary, and recognizes the subject entity and attribute information of the question request by combining the entity dictionary and the attribute dictionary.

Preferably, the template matching module comprises a rule-based matching module and a TextCNN network-based matching model;

if the problem request is a standard statement, a rule-based matching submodule is used;

if the question request is a non-canonical sentence, a matching sub-model based on TextCNN network is used.

Preferably, the problem analysis module further includes:

and the entity correction module is used for correcting the wrong problem request input by the user based on the constructed inverted index dictionary.

Preferably, the knowledge graph query module comprises a knowledge graph based on a Neo4j graph database and a Cypher query engine, so that answers corresponding to the question requests are searched in the knowledge graph based on the Neo4j graph database through the Cypher query engine.

Preferably, the problem analysis module is further configured to:

and combining the problem template, the topic entity and the attribute information, and constructing a Cypher query statement supported by the Neo4j graph database so as to meet the requirements of the Cypher query engine.

Preferably, the processing of the answer by the answer generation module includes performing string splicing, de-duplication, sequencing and fusion on the fine granularity information retrieved in the Neo4j graph database, and returning the fine granularity information to the user in a form of natural language short text.

Compared with the related art, the template matching type knowledge question-answering model in the water conservancy field has the following beneficial effects:

1. according to the method, each character is accurately identified in a part-of-speech labeling mode, and a professional domain dictionary and a labeling strategy can be created for professional words or proper names, so that a knowledge blind area in the professional domain is supplemented.

2. The invention carries out synonymous conversion based on template matching to avoid ambiguity, effectively increases the accuracy of natural language analysis and reaches the degree of accurate matching of each character.

3. According to the invention, the question sentence of the question request is split into a plurality of prompt words, so that the intention of the question is accurately identified.

4. The invention can accurately match each character, and can effectively remove noise, eliminate ambiguity and accurately identify the intention of the problem for spoken language expression.

Drawings

FIG. 1 is a schematic diagram of the structure of the present invention;

fig. 2 is a schematic diagram of a question-answering flow according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a template matching knowledge-graph question-answering model with shared BERT codes according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a classification flow according to a first embodiment of the invention;

FIG. 5 is a graph showing the comparison of the change of the evaluation index under the problem of different number scales according to the first embodiment of the present invention;

FIG. 6 is a flow chart of a second embodiment of the present invention;

FIG. 7 is an example of part-of-speech tagging in accordance with a second embodiment of the present invention;

FIG. 8 is a problem template matching example according to a second embodiment of the present invention;

FIG. 9 is a template matching flow chart based on textCNN according to a second embodiment of the present invention;

fig. 10 is a flowchart of an answer search method according to a second embodiment of the invention.

Detailed Description

The invention will be further described with reference to the drawings and embodiments.

The water conservancy domain knowledge graph question-answering system constructed by the technical scheme is used for serving staff for flood prevention, has a clearer question type and a more unified question expression method in a question-answering interaction scene, and can be regarded as knowledge graph question-answering of a water conservancy limiting domain, so that a template matching method is adopted.

Example 1

Referring to fig. 1 and fig. 2, the template matching type knowledge question-answering model in the water conservancy field provided by the invention mainly comprises a question obtaining module, a question analysis module, a knowledge graph query module and an answer generation module, wherein the question obtaining module is used for receiving and outputting a question request input by a user, and the question request is presented in a natural language form; the problem analysis module is used for receiving the problem request output by the problem acquisition module, carrying out semantic analysis on the problem request, and obtaining and outputting a problem template corresponding to the problem request; the knowledge graph query module is used for receiving the question template output by the question analysis module, searching in the knowledge graph, and obtaining and outputting an answer corresponding to the question template; and the answer generation module is used for receiving the answer output by the knowledge graph query module, processing the answer and finally feeding the processed answer back to the user.

In this embodiment, the question analysis model takes a natural language question as input, and based on word segmentation and topic entity extraction, the intention of the question is understood through question classification, and then question template matching is performed to generate a Cypher query sentence which is originally supported by the Neo4j graph database.

In this embodiment, the problem analysis model mainly comprises a topic entity recognition module, a problem classification module and a template matching module, where the topic entity recognition module is used to extract subject entities and attribute information in a problem request; the problem classification module is used for classifying the problem request after the subject entity and the attribute information are extracted, and obtaining a classification result; the template matching module is used for matching out a problem template according to the classification result.

Specifically, in the topic entity recognition module, the problem request is accurately analyzed, and the topic entity in the problem request is accurately extracted on the premise of realizing knowledge question answering, and the task essence is to recognize the problem by using a named entity oriented to a question short text in the water conservancy field. For example, taking the water conservation knowledge of the Jiaxing river basin (this is taken as an example in the following examples), the user asks "what are the backbone channels of the sea salt pond? The output structure of [ "sea salt pond", "backbone river" ] is generated after identification, wherein "sea salt pond" is the subject entity of the problem, and backbone river is the attribute information corresponding to the subject entity.

Specifically, in the problem classification module, the common types of the problems of the Jiaxing river basin are designed around five knowledge topics including south typhoons, sea salt ponds, foreign regions, water conservancy specifications and urban defense engineering by combining example triplet information covered in a knowledge map of the water conservancy field and business objects in the Jiaxing river basin, as shown in table 1:

TABLE 1 problem classification example

Specifically, in the template matching module, for each type of jiaxing river basin problem, a corresponding problem template is designed, and part of the problem templates are shown in table 2, and five subject problem templates are listed in table 2. Wherein the question subject entity tag ntt represents a south typhoon entity, hyt represents a sea salt pond entity, wq represents a fair region entity, slgf represents a water conservancy standardization entity, cfgc represents an urban protection engineering entity.

In view of the same subject, there may be some difference in the expression of the user input, and the manner of matching based on the character strings is difficult to deal with the knowledge question-answer scenario. For example, the question "what are the backbone channels of a sea salt pond? "may be expressed as" what are the main channels of sea salt ponds? "what are the major branches of sea salt ponds? "etc. The above problems are all matched with the problem template of what are the backbone channels of river? "therefore, for the jiaxing watershed question sentence after the extraction of the subject matter, the matching process is essentially the task of calculating the similarity between the questions and the question templates and classifying the questions.

Table 2 problem template example

In this embodiment, the knowledge graph query module includes a knowledge graph based on a Neo4j graph database and a Cypher query engine, so as to retrieve, by the Cypher query engine, an answer corresponding to the question request from the knowledge graph based on the Neo4j graph database.

Specifically, the knowledge graph query module realizes the information interaction process between the Jiaxing river basin problem and the Neo4j graph database, namely, the text of the Jiaxing river basin problem is abstractly expressed as a Cypher query statement of the Neo4j graph database, and answers of the questions are searched in the Jiaxing river basin knowledge graph through a Cypher query engine. The generation of the Cypher query statement depends on the topic entities of the question detected in the question resolution stage, the exact classification of the question, and the final matched question template. The topic entity is used for replacing the content of the marked part in the problem template, and then the Cypher language is used for constructing the structured query statement of the corresponding template.

Thus, the mapping of the jaxing river basin question text to the Cypher statement requires a one-to-one correspondence of the question template and the Cypher query. Table 3 below lists the Cypher structured query statements corresponding to a portion of the simple question templates, and the query for complex questions can be accomplished by combining multiple Cypher query statements.

Table 3Cypher query statement example

In this embodiment, the answer generation module is configured to receive the answer output by the knowledge graph query module, process the answer, and finally feed back the processed answer to the user, where the processing process is to perform string splicing, deduplication, ranking and fusion on the fine granularity information retrieved in the Neo4j graph database, and return the fine granularity information to the user in a form of natural language short text, so as to ensure consistency of the question and answer types.

For example, for a de facto question "what are the backbone channels of sea salt ponds? By the processing of the question analysis and knowledge query stage, the natural language question Cypher query "match (m 1: river) - [ r: relay ] - (m 2: river) wherem1. Name= 'sea salt pond' return m2.Name" is obtained.

The query result in Neo4j graph database is: "(suburban river, sixian bridge harbor, yu Fengtang, rumex harbor, momordica grosvenori, zhu Guqiao harbor, guozhong river, pickles harbor, tujinggang, sungbang harbor, baiyaojing harbor, changcheng harbor, baiyaogong harbor, lv Zhong harbor, rich Hong Tang, cannabis Jing harbor, wutong harbor, baiyaohe)".

The answer returned to the user is: the sea salt pond left-shore backbone river channel comprises suburban river, sixian bridge harbor, yu Fengtang, hai Qing Jiu harbor, momordica grosvenori pond, zhu Guqiao harbor, gu Cheng river, jiang Yuan harbor, tung Ji, su Ping harbor and Bai Data river; the diaphysis river course of the right bank of the sea salt pond comprises: suburban, long-falling, pengcheng, baitengting, lv Zhong, hong Tang, heme, martial arts, and white ocean.

According to the question-answering flow and task definition, decomposing the question-answering task matched with the knowledge graph template in the water conservancy domain into two key subtasks: extracting a topic entity and classifying problems, and providing a knowledge graph question-answering model sharing BERT, wherein the model can accurately identify the topic entity in a water conservancy field question, judge the type of the water conservancy field question, map to a corresponding problem model and a Cypher query, replace the corresponding concept words in the Cypher with the identified topic entity, and generate an actual queriable Cypher statement.

Specifically, for a fine-granularity knowledge graph question-answering task in the water conservancy field, a BERT pre-training language model is used as an embedded model of a problem request, a template matching knowledge graph question-answering model sharing BERT codes is provided, the overall architecture of the model is shown in figure 3, and the model comprises a subject entity extraction part and a problem classification part.

The topic entity extraction task of the water conservancy domain question and answer is to correctly identify keywords with user intention information from a question request as a subject entity of the question, and the topic entity extraction task is essentially a named entity identification task oriented to short text of the question, so that the same BERT-BiLSTM-CRF neural network model as that in the information extraction task is adopted, an external domain knowledge dictionary is introduced on the basis, the water conservancy domain question is segmented and part of speech labeled through a Hanlp segmentation tool, topic entities in the question are further identified, errors and omission of BERT-BiLST M-CRF in identifying a plurality of topic entities in the short text question are avoided, and the accuracy of subject entity identification is improved. The dictionary contains a large number of water conservancy domain professional word entities and labels the parts of speech of the entities, and the vocabulary examples and part of speech labeling strategies in the water conservancy domain dictionary are shown in table 4 below.

Table 4 field dictionary part-of-speech tagging strategy

Part of speech type	Part-of-speech tag	Vocabulary example
			Entity	hyt	Sea salt pond
Attributes of	attr	Elevation of embankment
			Date of day	date	2023, 9, 1
Region of	district	Bai Yangcun and Shen Dangzhen
			Digital number	number	0.1、0.2、2.3
Comparison of	compare	Greater than and less than

Because the Hanlp word segmentation tool lacks professional vocabulary of the business scene of the Jiaxing river basin, the knowledge dictionary in the water conservancy field is introduced by combining the characteristics of the water conservancy field, and the Hanlp word segmentation effect can be improved. Table 5 below gives examples of the glossary of the Jiang river basin question and part of speech tagging under the Hanlp segmentation tool.

Table 5 examples of glossary segmentation and part of speech tagging for jia xing river basin

On the basis of correctly acquiring the water conservancy domain problem topic entity, in order to further understand the actual user intention of the problem request, a water conservancy domain problem classification model is provided, the model takes a BERT pre-training language model as an embedded layer of a question, and then a naive Bayesian classification model is adopted to classify the water conservancy domain problem.

The problem classification is carried out on the problems in the water conservancy field by using a naive Bayesian classification model, so that the problem searching range can be narrowed, an overall problem classification flow is shown in fig. 4, and the problem classification mainly comprises three stages: data preparation, model training, and problem classification prediction.

The data preparation part mainly divides the types of the problems, and aims at each type of problem with the assistance of water conservancy domain experts, so that different problem expression modes are listed as much as possible, and a problem training corpus is constructed. The model training part carries out problem vectorization processing on the training corpus by using the BERT pre-training language model, then carries out classification training on the problems by using a naive Bayesian network, finally obtains a trained problem classification model, and the problem classification prediction part carries out vectorization coding on the input problems by using the BERT pre-training language model, and then carries out type prediction on the Jiaxing river basin problems by using the trained naive Bayesian classification model.

Naive Bayes is a linear classifier based on the Bayes principle and assuming that features are mutually independent, and has good effect on processing small-scale sample classification tasks. The bayesian theorem is used as a theoretical basis, and is defined as the following formula:

the objective of problem classification by using a naive Bayesian classifier is to give any one of the related problems and predict the knowledge topic to which the problem belongs. Assume that there are n topics for the Jiaxing river basin problem in total, i.e., there are n problem category labels Y= { Y ₁ ,y ₂ ,y ₃ ,…,y _n }. For a Jiaxing river basin problem sequence X, the theme of the sequence is predicted to be y _i The probability of (2) isP(y _i I X), defined as the following formula:

wherein P (y) _i |X)>P(y _j I X), 1.ltoreq.i, j.ltoreq.n, i.noteq.j. And then maximize P (y _i I X). If P (X) is a constant, P (y) _i |X ₎ The probability is maximum, only P (y) _i )P(X|y _i ) And (5) the maximum value is obtained. Without knowing P (y _i ) In the case of a specific probability, due to P (y _i ) Has little influence on the calculation result, so it is assumed that P (y ₁ )＝P(y ₂ )＝…＝P(y _n ). According to the naive Bayes assumption, i.e. features are independent of each other, then P (X|y _i ) As shown in equation (3).

Thus, the naive bayes classifier f (X) is fully expressed as follows:

according to the extraction result of the subject entity, abstracting the question request, and matching the question with the corresponding question template through the question classification. As shown in table 6, the template matching result after the problem classification corresponding to the example of the jiaxing river basin problem in table 5 is shown in table 6, and "where is the geographic location of the south station head project? The concept label ntt is used for replacing the subject entity 'south platform head project', attr is used for replacing the attribute 'geographic position', the abstract problem description 'ntt attr is obtained and is subjected to problem classification, and the abstract problem is matched with the corresponding problem template' entity 'attribute', so that the conversion from the natural language question of the Jiaxing river basin to the problem template is realized.

Table 6 problem template matching examples

Experimental analysis

In the aspect of model effect evaluation, three commonly used classification model evaluation indexes of an accuracy rate P, a recall rate R and an F1 value are selected to test the model performance of the template matching type knowledge question-answering method. The calculation formula of the evaluation index is defined as follows:

R＝TP/TP+FP

R＝TP/TP+FN

F1＝2*P*R/P+R

the experimental results of the topic entity extraction task are shown in Table 7, and compared with the BERT-CRF model, the subject entity extraction performance of the model is improved by 1.47%.

TABLE 7 subject entity extraction experiment results

Model	P(％)	R(％)	F1(％)
				CRF	77.26	76.58	89.82
BiLSTM-CRF	88.98	85.32	87.11
				BERT-CRF	90.37	88.23	89.29
BERT-BiLSTM-CRF	91.73	89.82	89.29

For a template matching task, as the current industry does not have question-answer corpus of a knowledge graph aiming at a Jiang river basin scene, the scheme combines the existing problem template, adopts the values of the precision rate P, the recall rate R and the recall rate F1 as evaluation indexes, and selects 3628 Jiang river basin question-answer pairs from the initially constructed Jiang river basin question-answer corpus to carry out question-answer experiments. By combining with template matching type question-answering specific tasks, considering the complexity of reasoning type and calculation type questions, under the guidance of water conservancy field experts, the question mode of 2839 questions is adaptively improved, wherein 223 questions and answers of a south China knowledge theme, 178 questions of a sea salt pond, 1395 questions of a water conservancy standard, 875 questions of a city defense project are 168. In order to avoid the influence of template type matching errors on question and answer results, 10 overall experiments are carried out on 2839 questions in total, the three questions are sequentially arranged randomly in each experiment, and the average value of the 10 results is taken as a comprehensive evaluation index, and the experimental results are shown in table 8.

Table 8 jiaxing river basin field problem template matching experiment results

Template type	P(％)	R(％)	F1(％)
				South desk head	88.26	79.43	83.61
Sea salt pond	90.34	78.94	83.34
				Area of fair water	86.32	81.75	83.98
Water conservancy specification	87.16	76.70	81.60
				Urban protection engineering	89.32	83.21	85.15

In order to verify the influence of the scale of the data set of the question and answer of the jiaxing river basin on the experimental effect, the change condition of three evaluation indexes under the different number scale of the questions of the jiaxing river basin is further tested, and the experimental result is shown in fig. 5.

From the experimental results in table 8, it can be seen that the current knowledge question-answering system can better answer the south-platform question, the sea salt pond question, the water conservancy standard question and the urban engineering question, however, the accuracy of the polder region question is still to be improved, and the follow-up continuous optimization and improvement work of the scheme is also to be realized. The analysis is that the problems of south China's head, sea salt ponds, water conservancy specifications, urban protection engineering and the like are mainly simple facts, and part of the problems of the country region relate to multi-hop calculation of multi-level knowledge, and have relatively more complex problem scenes. The experimental results of fig. 4 further illustrate that, as the scale of the question-answering corpus increases, the performance of the model is improved, so that the construction of the question-answering corpus in the large-scale jia xing river basin is helpful to promote the further research of the knowledge graph question-answering deep neural network model in the field.

The scheme uses the problem classification as a key task and introduces the related work of template matching type water conservancy knowledge graph questions and answers. Firstly, summarizing an overall question-answering flow of a template matching type question-answering based on question classification, and giving task definitions of various subtasks; secondly, a template matching type knowledge graph question-answering model sharing the BERT is provided, an external domain dictionary is added into a topic entity recognition model, and a naive Bayesian classifier is adopted for a problem classification task; and finally, carrying out a question-answering experiment, and verifying that the template matching type knowledge graph question-answering model has remarkable advantages in answering the Jiaxing river basin questions.

Example two

In this embodiment, referring to fig. 1, the model is also mainly composed of four parts, namely a question acquisition module, a question analysis module, a knowledge graph query module and an answer generation module, where the question acquisition module is configured to receive and output a question request input by a user, and the question request is presented in a natural language form; the problem analysis module is used for receiving the problem request output by the problem acquisition module, carrying out semantic analysis on the problem request, and obtaining and outputting a problem template corresponding to the problem request; the knowledge graph query module is used for receiving the question template output by the question analysis module, searching in the knowledge graph, and obtaining and outputting an answer corresponding to the question template; and the answer generation module is used for receiving the answer output by the knowledge graph query module, processing the answer and finally feeding the processed answer back to the user.

The problem analysis model of the embodiment also comprises a topic entity recognition module, a problem classification module and a template matching module, wherein the topic entity recognition module is used for extracting subject entities and attribute information in a problem request; the problem classification module is used for classifying the problem request after the subject entity and the attribute information are extracted, and obtaining a classification result; the template matching module is used for matching out a problem template according to the classification result.

In this embodiment, the data source is an established knowledge graph of the water conservancy domain, the knowledge graph is stored by using a Neo4j graph database, and the data is stored in the form of nodes, relationships and attributes, so that a user can query the knowledge graph by using the Cypher language.

Because the types of problems in the water conservancy field are limited, the names of entities in the knowledge base are proper nouns, the system adopts a semantic analysis mode to construct a knowledge base question-answering system, and the core thought is that the semantic analysis is carried out on natural language problems proposed by users, the natural language problems correspond to specific problem templates, and answers are searched in a graph database after the natural language problems are converted into query sentences.

Referring to fig. 6, in the present embodiment, first, a natural language question is input by a user, then, a Chinese word segmentation tool is used to perform word segmentation based on an auxiliary dictionary on a question request, and at the same time, a topic entity recognition module is used to recognize topic entities and attribute information contained in the dictionary, and then, whether topic entities exist in question sentences of the question request is determined. If no entity exists, entity correction based on the inverted index dictionary is performed, and if correction fails, the user is required to reenter the problem. And in other cases, entering the next step, carrying out problem matching through a problem classification module and a template matching module, specifically judging whether attribute information exists in a question sentence of a problem request, if so, directly using a problem template based on rules to match, if not, using the problem template based on a textCNN network to match, converting the problem template into a knowledge base query sentence after the matching is completed, carrying out answer retrieval, and finally outputting an answer.

In the topic entity and attribute information identification process, an entity dictionary is built by using all topic entities and alias attributes thereof in a knowledge base, an attribute dictionary is built by using attribute names of all topic entities, and expansion words are added, for example: the gate station is added with gates, water gates and the like; for the attribute of 'water level', expansion words such as 'super guaranteed water level' are added. The dictionary also sets the custom part of speech, for example, the part of speech of "gate station" is the entity, and the part of speech of "water level" is attr, etc.

When entity recognition is carried out, a word segmentation dictionary is imported into a word segmentation tool jieba, and the natural language problem input by a user is segmented and part of speech labeled. What is the water level of question "{ xxx } gate 2023, 8, 7 days? ", the processing results are shown in fig. 7.

Therefore, the entity correction module is used for correcting the error problem request input by the user based on the constructed inverted index dictionary, and for the entity correction based on the inverted index dictionary, a large number of place names and river names with more rarely-used names exist in the entity in the water conservancy domain knowledge base, such as 'double-bridge-art-state area', and the error input condition when the user inputs the entity name needs to be considered. In order to enhance the robustness of dictionary matching, a fuzzy matching mechanism is set. And constructing an inverted index dictionary of word-to-word mapping by using all entity names in the knowledge base. For example, the inverted index of the dictionary { double per bridge irrigation zone, double bridge surrounding zone, double bridge polder zone } is shown in table 1.

Table 9 inverted index instance

Character(s)	Inverted index dictionary
		Double-piece	{ double per bridge irrigation zone, double bridge surrounding zone, double bridge polder zone }
Beauty device	{ double bridge polder region }
		Bridge	{ double per bridge irrigation zone, double bridge surrounding zone, double bridge polder zone }
Water-repellent device	{ double bridge polder region }
		Zone(s)	{ double per bridge irrigation zone, double bridge surrounding zone, double bridge polder zone }

When the entity or attribute is not matched in the input question, using an inverted index dictionary to obtain all the entities or attributes corresponding to each character in the question, counting the accumulated times of each entity or attribute, selecting three entities or attributes with the largest occurrence times to add into a candidate list, and determining the entity to be queried through user interaction.

Because the problem forms in the water conservancy field are limited, semantic analysis is carried out on the question by adopting a problem template matching mode. The specific implementation mode is to manually design a question template, classify and match questions by using a rule or a neural network model, and match the questions to a certain template to correspond to a query sentence, and combine two modes of rule-based matching and textCNN network model-based matching for coping with the complexity of the question mode in the Chinese context, so that a rule-based matching module and a textCNN network-based matching model are included in a template matching module;

If the question request is a canonical sentence (equivalent to containing attribute information), then a rule-based matching submodule is used;

if the question request is a non-canonical sentence (equivalent to no attribute information found), then a TextCNN network-based matching sub-model is used.

For questions with more standard question ways, rule-based question template matching is a simple and efficient way. After the word segmentation tool processes, entities, attributes, attribute values, comparers, maximum value symbols and the like in the question sentences are expressed through parts of speech, and the corresponding query sentences can be generated by directly matching the question templates. The rule-based question templates can be divided into four types of single-entity single-attribute questions and answers, single-entity multi-attribute questions and answers, single-entity attribute maximum questions and answers and single-entity attribute interval questions and answers, and specific templates and question sentence examples are shown in fig. 8.

Although the rule-based question template matching is simple and efficient, it is difficult to analyze different question ways of the same semantic meaning in the Chinese context, such as question sentence "what are backbone channels of sea salt pond? What are the main branches of the sea salt pond? "represents the same semantic meaning and the corresponding answer output should be the same. Failure to identify a valid attribute name in the previous question will result in a rule-based match failure. To cope with this situation, a TextCNN neural network model with optimal performance is introduced to perform problem template matching.

Specifically, the textCNN network model is specially used for text classification, has a relatively simple structure and high accuracy, and is widely applied to the fields of natural language processing and recommendation. Let the input of the model be the text t= (x 1, x2, …, xn), containing n words, each word being a distributed representation of k dimensions, performing a convolution operation with the convolution formula:

wherein,is a convolution operator; x is x _i:i+j Representation of pair x _i ,x _i+1 ,…,x _i+j And (5) performing convolution.

Assuming that the convolution kernel w is k in length and h in width, then the feature c of the ith term to the (i+h-1) th term _i Expressed as:

c _i ＝f(w·x _i:i+h-1 +b)

wherein f is a nonlinear function; b is an offset. After the convolution operation of the convolution layer, the feature matrix c of each word in different window sizes can be obtained, and the following formula is shown:

c＝[c ₁ ,c ₂ ,…,c _n-h+1 ]

and carrying out maximum pooling operation on a pooling layer, wherein the maximum pooling operation is shown in the following formula:

the maximum value obtainedThe characteristic value corresponding to the convolution kernel is obtained, and classification can be completed through the full connection layer and the softmax layer.

The problem template matching method based on textCNN is to convert the question sentence identified by the entity into word vector by word2vec model processing, and then classify the word vector by textCNN network model, and the question sentence is classified to a certain template to correspond to a knowledge base query sentence, as shown in figure 9.

The data set used for the neural network training is constructed manually, different question modes are collected aiming at the question template, and synonyms are used for replacing the amplified data set.

In the training phase, text is converted into a word vector matrix of n x k, where k represents the dimension of the word vector and n represents the maximum length of the sentence, using word2vec model in python tool library gensim. And then constructing a textCNN network model by using a Python deep learning library Keras, and training by using a data set. In the textCNN model, the convolution kernels and word vectors have the same width but different heights, and the convolution kernels with the heights of 3, 4 and 5 are selected for operation. In the pooling layer, the maximum value of each feature vector is extracted to represent the feature, and scalar quantities formed after the convolution kernel operation with the same height are combined to form the feature vector. Finally, at the fully connected layer, the probability belonging to each class is obtained using the ReLU as an activation function and the softmax function. The L2 and dropout methods are integrally used, and the gradient descent method is adopted to update parameters and optimize the model.

In the application stage, after the question is identified by the entity, the entity is replaced by part of speech to form text input. The word2vec model is converted into a word vector matrix, and the word vector matrix is input into a trained textCNN network model, so that a corresponding problem template can be output, and the word vector matrix is converted into a Cypher query sentence to perform answer query of a graph database.

The problem template matching method based on textCNN has the advantage of having more accurate understanding capability on different questions of the same semantic. Such as what are there to the backbone channels of the question "[ entity ] after entity recognition? "[ entity ] what is the main tributary? "[ what are the main channels of entity? ", the standard attribute names cannot be directly extracted through attribute recognition, but the three questions can be effectively classified into a template" [ entity ] + [ attribute) through TextCNN: main river course ] ".

The knowledge graph query module combines the question template with the specific entity name and the attribute name to construct a Cypher query statement, so that an answer is obtained by querying in a graph database, wherein Cypher is a declarative graph database query language defined and realized by Neo4 j.

Neo4j database access was performed using the python toolkit py2 Neo. Each template corresponds to a Cypher statement which is manually constructed in advance. And taking the identified entity names and attribute names as variables to be transmitted into a Cypher statement, so that the query of the graph database can be realized, and an answer is obtained. For example, for a question template "[ entity ] + [ attribute ]", the corresponding Cypher query statement is "match n sphere n.name= { entity } return n.attr", and the answer is obtained by entering entity name entity and attribute name attr. If no answer can be found, a "query failure" is returned.

The template matching type knowledge question-answering model in the water conservancy field has the following advantages.

(1) And (5) template matching. Each character is accurately identified by means of part-of-speech tagging. For professional words or proper names, a professional domain dictionary and a labeling strategy can be created to supplement knowledge blind areas of the professional domain.

(2) Synonym transformation. For example, { polder region } polder region manager/manager, where "manager" and "manager" are synonyms referring to the same physical object, so the answer is the same. And the synonymous conversion of professional words or proper nouns is supported, for example, the conditions of 'overstock' and 'overstock water level' are supported, and the synonymous conversion is carried out based on template matching to avoid ambiguity, so that the accuracy of natural language analysis is effectively improved, and the degree of accurate matching of each character is achieved.

(3) The question sentence can be split into a plurality of prompt words, and the intention of the question is accurately identified. At present, a system splits a sentence into 5 keywords at most, namely an entity, an attribute 1, an attribute 2, a date yyyyy-mm-ss and a digital float. The subject is generally the most critical word in a sentence, and directly indicates what the physical object of knowledge is. The number of attributes is two at present, and the number of attributes can be increased to three or four, so that the aim of distinguishing the uniqueness of the sentence is fulfilled. The date and the number are equivalent to other two entity objects, and the relationship is established through the keywords in the front, so that a sentence of knowledge points with meaning is formed together.

(4) Support the habit of expressing spoken language. Each character can be accurately matched, noise can be effectively removed for spoken language expression, ambiguity is eliminated, and the intention of the problem is accurately identified.

The application designs a set of answering system aiming at a knowledge graph in the water conservancy field. The system uses an auxiliary dictionary to carry out entity recognition and attribute recognition on natural language questions input by a user, uses a mode of combining rules and a textCNN network model to carry out question template matching, finally generates a graph database query sentence to carry out answer retrieval to obtain an answer, and has a certain reference value for the landing application of knowledge maps in the water conservancy field.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium including Read-only memory (ROM), random access memory (RandomAccessMemory, RAM), programmable Read-only memory (PROM), erasable programmable Read-only memory (ErasableProgrammableReadOnlyMemory, EPROM), one-time programmable Read-only memory (One-OnlyMemory, OTPROM), electronically erasable programmable Read-only memory (electronically erasable programmable Read-OnlyMemory, EEPROM), compact disc Read-only memory (CD-ROM) or other optical disc memory, magnetic disk memory, tape memory, or any other medium capable of being used for computer readable storage or carrying data.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

Claims

1. The utility model provides a template matching formula water conservancy field knowledge question-answering model which characterized in that includes:

2. The template matching water conservancy domain knowledge question-answering model according to claim 1, wherein the question analysis module comprises:

3. The template matching type water conservancy domain knowledge question-answering model according to claim 2, wherein the topic entity identification module and the question classification module take a BERT pre-training model as an embedded model, identify topic entities and attribute information of a question request through the BERT pre-training model, and judge the type of the question request.

4. A template matching type water conservancy domain knowledge question-answering model according to claim 3, wherein the question classification module classifies the question request using a naive bayes classification model.

5. The template matching type water conservancy domain knowledge question-answering model according to claim 2, wherein the topic entity recognition module recognizes subject entity and attribute information of a question request by combining an entity dictionary and an attribute dictionary based on the constructed entity dictionary and the attribute dictionary.

6. The template matching type water conservancy domain knowledge question-answering model according to claim 5, wherein the template matching module comprises a rule-based matching module and a TextCNN network-based matching model;

7. The template matching water conservancy domain knowledge question-answering model according to claim 6, wherein the question analysis module further comprises:

8. The template matching type water conservancy domain knowledge question and answer model of claim 1, wherein the knowledge graph query module comprises a knowledge graph based on a Neo4j graph database and a Cypher query engine, so that answers corresponding to the question requests are searched in the knowledge graph based on the Neo4j graph database by the Cypher query engine.

9. The template matching water conservancy domain knowledge question-answering model according to claim 1, wherein the question analysis module is further configured to:

10. The template matching type water conservancy domain knowledge question-answer model of claim 8, wherein the answer generation module processes answers including string splicing, de-duplication, ordering and fusion of fine granularity information retrieved in Neo4j graph database, and returns to the user in the form of natural language short text.