CN113282729A

CN113282729A - Question-answering method and device based on knowledge graph

Info

Publication number: CN113282729A
Application number: CN202110632872.XA
Authority: CN
Inventors: 潘璋; 李长亮; 李小龙
Original assignee: Beijing Kingsoft Software Co Ltd
Current assignee: Beijing Kingsoft Software Co Ltd
Priority date: 2021-06-07
Filing date: 2021-06-07
Publication date: 2021-08-20
Anticipated expiration: 2041-06-07
Also published as: CN113282729B

Abstract

The application provides a question-answering method and a device based on a knowledge graph, wherein the question-answering method based on the knowledge graph comprises the following steps: obtaining and analyzing a problem to be processed, and determining a problem subject term in the problem to be processed; obtaining a problem semantic feature vector and at least two reference semantic feature vectors; determining a target subject term from the at least two reference subject terms based on the problem semantic feature vector and the at least two reference semantic feature vectors; and determining a target text related to the target subject word from a pre-created knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an association relation between a reference subject word and a text title. The target subject term determined by the scheme is more accurate, and the accuracy of the target text determined based on the target subject term and the accuracy of the obtained answer are higher.

Description

Question-answering method and device based on knowledge graph

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a knowledge graph-based question answering method and apparatus, a computing device, and a computer-readable storage medium.

Background

The knowledge-graph-based question-answering system mainly comprises two processes of knowledge graph construction and question-answering matching. The knowledge graph is usually constructed in advance, question-answer matching is performed when a question to be processed is obtained, and the accuracy of determining a target text related to the question in the question-answer matching is closely related to the accuracy of obtaining an answer to be processed.

In the prior art, keywords in a problem to be processed are generally obtained, the problem keywords in the problem to be processed are matched with text keywords of a target text, the successfully matched text keywords are used for replacing the problem keywords, then the problem to be processed including the text keywords is converted into a logic expression which can be inquired in a knowledge graph, and the knowledge graph includes the text keywords, so that answers of the problem to be processed can be inquired in the knowledge graph.

However, in the above method, the matching result determined by a simple matching method may be inaccurate, which may cause that the text keyword successfully matched is not particularly related to the question keyword, and therefore, the target text queried in the knowledge graph based on the irrelevant text keyword may not be related to the text to be processed, which may cause the accuracy of the determined answer to be lowered.

Disclosure of Invention

In view of this, embodiments of the present application provide a question-answering method and apparatus based on a knowledge graph, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.

According to a first aspect of the embodiments of the present application, there is provided a knowledge-graph-based question-answering method, including:

obtaining and analyzing a problem to be processed, and determining a problem subject term in the problem to be processed;

acquiring a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of the problem subject word, the reference semantic feature vectors are feature vectors of reference subject words in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words;

determining a target subject term from the at least two reference subject terms based on the question semantic feature vector and at least two reference semantic feature vectors;

and determining a target text related to the target subject word from a pre-established knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an incidence relation between a reference subject word and a text title.

According to a second aspect of embodiments of the present application, there is provided a knowledge-map-based question-answering apparatus including:

the first acquisition module is configured to acquire and analyze a problem to be processed and determine a problem subject term in the problem to be processed;

a second obtaining module, configured to obtain a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of the problem topic word, the reference semantic feature vectors are feature vectors of reference topic words in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference topic words;

a first determination module configured to determine a target subject term from the at least two reference subject terms based on the question semantic feature vector and at least two reference semantic feature vectors;

the second determination module is configured to determine a target text related to the target subject word from a pre-created knowledge graph based on the target subject word, and determine an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an association relation between a reference subject word and a text title.

According to a third aspect of embodiments herein, there is provided a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the knowledge-graph based question-answering method when executing the instructions.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the knowledge-graph based question-answering method.

According to a fifth aspect of embodiments of the present application, there is provided a chip storing computer instructions that, when executed by the chip, implement the steps of the knowledge-graph based question-answering method.

In the embodiment of the application, a problem to be processed is obtained and analyzed, and a problem subject term in the problem to be processed is determined; acquiring a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of the problem subject word, the reference semantic feature vectors are feature vectors of reference subject words in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words; determining a target subject term from the at least two reference subject terms based on the question semantic feature vector and at least two reference semantic feature vectors; and determining a target text related to the target subject word from a pre-established knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an incidence relation between a reference subject word and a text title. According to the scheme, the problem subject words and the reference subject words are converted into vector representation forms, the problem semantic feature vectors can accurately represent the semantics of the problem subject words, the reference semantic feature vectors can accurately represent the semantics of the reference subject words, the determination of the target subject words based on the problem semantic feature vectors and the reference semantic feature vectors is more accurate, and the accuracy of the target texts and the obtained answers determined based on the target subject words is higher.

Drawings

FIG. 1 is a block diagram of a computing device according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a knowledge-graph based question-answering method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a knowledge-graph provided by an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for knowledge-graph based question answering applied to a policy question answering task according to an embodiment of the present application;

FIG. 5 is a schematic illustration of another knowledge-graph provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a knowledge-graph-based question answering device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "responsive to a determination," depending on the context.

First, the noun terms to which one or more embodiments of the present invention relate are explained.

Birirectional Encoder responses from transforms, transform-based bi-directional coded representation.

LAC, Lexical Analysis of Chinese, a Lexical Analysis tool, can realize the functions of Chinese word segmentation, part of speech tagging, proper name recognition and the like.

Neo 4J: a graph database.

KBQA: knowledge base interrogation.

K-means: and (3) based on the Euclidean distance clustering algorithm, the closer the distance between two targets is, the greater the similarity is.

Fuzzy Wuzzy A toolkit capable of realizing Fuzzy character string matching.

TF-IDF (term frequency-inverse document frequency) is a statistical method for evaluating the importance degree of a word unit to one of a text set or a corpus. The importance of a word unit increases in proportion to the number of times it appears in a single text, but at the same time decreases in inverse proportion to the frequency with which it appears in a corpus or corpus of text.

Word unit: before any actual processing of the input text, it needs to be segmented into language units such as words, punctuation marks, numbers or letters, which are called word units. For English text, a word unit can be a word, a punctuation mark, a number, etc.; for Chinese text, the smallest word unit can be a word, a punctuation mark, a number, etc.

Part of speech tagging model: and carrying out sequence tagging on word units in the input text to determine a part-of-speech tag of each word unit.

An encoding unit: and coding the input text to obtain the vector representation of the text.

A decoding unit: the decoding unit in this specification is used for performing sequence labeling on an input vector sequence.

GRU (Gate recovery Unit, gated cycle Unit): the method is an improved recurrent neural network model, an update gate and a reset gate are used, the two gating mechanisms can store information in a long-term sequence and cannot be cleared along with time, and the problem of long dependence in the recurrent neural network can be solved.

CRF (conditional random field) network: a probabilistic graphical model combines the features of a maximum entropy model and a hidden Markov model.

Next, an application scenario of the knowledge-graph-based question-answering method provided in the embodiment of the present application is explained.

Existing knowledge-graph based question-answering methods can be divided into two categories.

One is a symbolic representation method, which is mainly realized based on rule matching. The knowledge base can adopt some semantic analysis means when analyzing the problems to be processed, understand the problems to be processed by the processes of identifying entities, distinguishing relations and disambiguating entities, classify the problems to be processed by keyword matching, convert the problems to be processed into a logic expression capable of being inquired in a knowledge graph, and finally inquire corresponding answers of the problems to be processed in the knowledge graph. However, the method mainly relies on the direct matching of the predefined rule template, or the feature extraction through the self-defined template, and learns the problem to be processed and the information of the knowledge graph by using a machine learning algorithm, so that the method has a narrow application field, and does not utilize the information of the knowledge graph to carry out semantic analysis.

And the other method is that when the question and the answer are matched, the semantic representation is expressed in a distributed mode, the problem to be processed is analyzed through a distributed vector, then the information in the knowledge graph is vectorized, and the most appropriate answer is found in the knowledge graph through comparing the similarity of the matched vectors. However, this method requires a lot of effort to label data, has poor interpretability, and has insufficient generalization migration capability.

In order to solve the technical problems, the present specification provides a question-answering method based on a knowledge graph, which can acquire and analyze a problem to be processed, and determine a problem subject word in the problem to be processed; acquiring a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of the problem subject word, the reference semantic feature vectors are feature vectors of reference subject words in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words; determining a target subject term from the at least two reference subject terms based on the question semantic feature vector and at least two reference semantic feature vectors; and determining a target text related to the target subject word from a pre-established knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an incidence relation between a reference subject word and a text title. According to the scheme, the problem subject words and the reference subject words are converted into vector representation forms, the problem semantic feature vectors can accurately represent the semantics of the problem subject words, the reference semantic feature vectors can accurately represent the semantics of the reference subject words, the determination of the target subject words based on the problem semantic feature vectors and the reference semantic feature vectors is more accurate, and the accuracy of the target texts and the obtained answers determined based on the target subject words is higher.

In the present application, a knowledge-graph-based question answering method and apparatus, a computing device, and a computer-readable storage medium are provided, which are described in detail in the following embodiments one by one.

FIG. 1 shows a block diagram of a computing device 100 according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein processor 120 may perform the steps of the knowledge-graph based question-answering method shown in fig. 2. Fig. 2 is a flowchart illustrating a method for knowledge-graph based question answering according to an embodiment of the present application, including steps 202 to 208.

Step 202: and acquiring and analyzing a problem to be processed, and determining a problem subject term in the problem to be processed.

In implementation, the obtained to-be-processed problem may include some stop words such as a question word, time, place, and organization, and the relevance between these words and the topic of the to-be-processed problem is relatively low, and to determine the answer of the to-be-processed problem, it is necessary to analyze the to-be-processed problem, determine the topic of the to-be-processed problem, that is, determine the problem subject word in the to-be-processed problem, and determine the text related to the to-be-processed problem according to the problem subject word.

It should be noted that the stop word may refer to a word unrelated to the subject of the to-be-processed question, or a word that is less helpful or even not helpful in determining the answer to the to-be-processed question.

In an embodiment of the present application, obtaining and analyzing a problem to be processed, and determining a specific implementation of a problem topic word in the problem to be processed may include: the method comprises the steps of obtaining a problem to be processed, inputting the problem to be processed into a part-of-speech tagging model, and determining a part-of-speech tag of each word unit in the problem to be processed; and determining a problem subject word in the problem to be processed based on the part-of-speech label of each word unit.

In particular implementations, the pending question may be a user-entered question, which may be a sentence, but which may include words unrelated to the subject matter of the question, which may not aid in determining the answer to the pending question and may increase the processing burden on the device. Therefore, after the problem to be processed is obtained, part-of-speech tagging can be performed on each word unit in the problem to be processed through the part-of-speech tagging model, that is, the part-of-speech tag of each word unit is determined.

For example, assuming that the part-of-speech tag corresponding to the question topic word may be a topic word, taking the example that the to-be-processed question is "what the policy of XX mechanism about a certain item" as an example, the to-be-processed question is input into the part-of-speech tagging model, and if the part-of-speech tag of "a certain item" is a topic word, it may be determined that the question topic word of the to-be-processed question is "a certain item".

The part-of-speech tagging model is obtained by training a plurality of sample texts with part-of-speech tags in advance, and can tag the part of speech of a word in an input text. The specific implementation process of the part-of-speech tagging performed by the part-of-speech tagging model can be referred to the following related description.

In the embodiment of the application, after the problem to be processed is obtained, the problem to be processed can be analyzed through the part-of-speech tagging model, the problem subject term of the problem to be processed is determined, namely the key content of the problem to be processed is determined, so that the problem subject term can be processed only on the basis of the problem subject term in the subsequent processing instead of the whole problem to be processed, the device can only process the problem subject term, and the data volume processed by the device is reduced relative to the processing of the whole problem to be processed, so that the processing burden of the device can be reduced.

In some embodiments, the part-of-speech tagging model includes a coding unit, a gated loop unit, and a decoding unit, the to-be-processed question is input into the part-of-speech tagging model, and the specific implementation of determining the part-of-speech tag of each word unit in the to-be-processed question may include: inputting the problem to be processed into the coding unit to obtain a word vector sequence of the problem to be processed; inputting the word vector sequence into the gating circulation unit to obtain a problem vector sequence of the problem to be processed; and inputting the question vector sequence into the decoding unit, and determining the part of speech label of each word unit.

The system comprises an encoding unit, a gate control circulation unit and a decoding unit, wherein the encoding unit is used for encoding an input problem to be processed, the gate control circulation unit is used for extracting the characteristic representation of a word vector sequence output by the encoding unit, and the decoding unit is used for realizing sequence labeling by taking a mark sequence as a supervision signal.

In a specific implementation, a problem to be processed may be input to an encoding unit, word segmentation processing may be performed on the problem to be processed to obtain a plurality of word units of the problem to be processed, then one-hot encoding may be performed on each word unit to obtain a vector representation of each word unit, a sequence formed by vector representations of the plurality of word units is referred to as a vector sequence of the problem to be processed, and then the vector sequence is converted into a word vector sequence.

In a specific implementation, the gating cycle unit may be a GRU, the word vector sequence is input to the gating cycle unit, and the gating cycle unit may perform forward propagation and backward propagation on each word vector in the word vector sequence to obtain a new vector representation of each word vector in combination with context information, thereby obtaining a problem vector sequence.

In a specific implementation, the decoding unit may be a serialization tagging network, and the serialization tagging network may be a CRF network, and the problem vector sequence is input into the decoding unit, so that the problem vector sequence may be subjected to sequence tagging, and further, part-of-speech tags of each problem vector may be determined.

As an example, an IOB tagging scheme may be employed, i.e., with X-B as the beginning of a word of type X, X-I as the continuation of a word of type X, and O as the word of no interest. Moreover, a correspondence table between the labels of the word units and the part-of-speech tags may be created in advance, and after the label of each word unit is determined, the part-of-speech tag of each word unit may be determined from the correspondence table according to the label.

For example, assuming that the to-be-processed question is "what the policy of the XX mechanism is about a certain item", inputting the to-be-processed question into the part-of-speech tagging model, it may be determined that tags corresponding to four words, i.e., "off", "poor", "attack", and "hard" are M-B, M-I, M-I and M-I, respectively, and a part-of-speech tag corresponding to the tag M is a subject word, it may be determined that the problem subject word of the to-be-processed question is a "certain item".

In the embodiment of the application, the problem to be processed can be subjected to sequence tagging through the part-of-speech tagging model so as to determine the part of speech of each word unit in the problem to be processed and the part-of-speech tag of each word unit, and further determine the problem subject word of the problem to be processed. Because the part-of-speech tagging model is a trained model with better performance, the problem subject word in the problem to be processed can be accurately determined, so that the text corresponding to the problem subject word can be conveniently determined.

It should be noted that, the above describes the implementation process of determining the problem topic by taking the part-of-speech tagging model as an example, in another embodiment of the present application, the problem topic may be determined from the problem to be processed by using the LAC algorithm, which is not limited in this application.

In the embodiment of the present application, the question-answering process may be implemented based on a knowledge graph, and the knowledge graph may be created in advance. Therefore, before obtaining and analyzing the problem to be processed, the method further comprises: constructing the reference dictionary based on the obtained plurality of texts; creating the knowledge-graph based on text titles of the plurality of texts and reference subject words in the reference dictionary.

The Knowledge Graph is a Knowledge Graph and is a Graph which comprises information and displays the association relation between the information through a structure.

In a specific implementation, a plurality of texts can be obtained from a text library, a reference dictionary is constructed based on the plurality of texts, and the incidence relation between the reference dictionary and the text titles of the plurality of texts is represented by a knowledge graph.

As an example, a subject word in a text title of each text may be extracted according to a preset rule, and a reference dictionary may be constructed based on a plurality of subject words. That is, the subject words extracted from the text titles may be grouped together to form a reference dictionary. Also, the reference dictionary may be related to a certain field, and stored therein are subject words that may be used in the text of the field.

For example, assuming that the plurality of texts are text related to policies, text contents in "in text titles, quotation marks, opinions about XX, and the like can be extracted by rules, and the extracted contents are used as subject words of policy fields, so that subject words of the plurality of policy fields can be extracted, and a reference dictionary of the policy fields can be constructed based on the subject words of the plurality of policy fields. For example, the "XX agency" draws "internet + government affairs service" therein as a subject term with respect to the guidance opinions for accelerating the work of pushing "internet + government affairs service".

In some embodiments, before constructing the reference dictionary based on the obtained plurality of texts, the method further includes:

acquiring a pre-constructed initial dictionary;

accordingly, a specific implementation of constructing the reference dictionary based on the obtained plurality of texts may include: matching the text titles of the texts with initial subject words in the initial dictionary, determining a first text title which does not contain the initial subject words, and determining a reference subject word of the first text title; and adding the reference subject word of the first text title to the initial dictionary to obtain the reference dictionary.

In a specific implementation, an initial dictionary may be constructed in advance by means of manual sorting or machine extraction, and the initial dictionary includes a plurality of initial subject words. On the basis of obtaining the initial dictionary, if a plurality of texts are obtained, the text titles of the plurality of texts can be matched with the initial subject words in the initial dictionary, if the text title of a certain text does not include the initial subject words in the initial dictionary, the text title is called a first text title for convenience of description, the reference subject words of the first text title can be determined, the reference subject words are added to the initial dictionary, and the initial dictionary to which new reference subject words are added is called a reference dictionary.

It should be noted that the initial dictionary and the reference dictionary in the present application are dictionaries corresponding to the same domain, and the domain is the same as the domain to which the problem to be processed belongs.

As an example, for a reference text in a plurality of texts, matching a text title of the reference text with an initial subject word may be considered as performing similarity calculation between the text subject word in the text title of the reference text and the initial subject word, and if the similarity is not 1, it may be considered that the initial subject word is not included in the text title of the reference text, and the text title of the reference text may be called a first text title.

In the embodiment of the application, an initial dictionary is pre-constructed through a self-defined rule, then a plurality of texts are obtained, the text titles are matched with initial subject words in the initial dictionary, and for a first text title which does not include reference subject words in a reference dictionary, the reference subject words of the first text title can be determined, and the reference subject words do not exist in the initial dictionary, so that the reference subject words of the first text title can be added to the initial dictionary, and the initial dictionary added with new reference subject words is called the reference dictionary. Therefore, the subject words in the dictionary can be enriched, the dictionary is more complete, more subject words can be covered as far as possible, and the subject words related to the problem subject words of the problems to be processed can be quickly determined when various problems to be processed are received.

As an example, the reference subject word of the first text title may be the text subject word of the first text title or may be a newly determined subject word, and determining the reference subject word of the first text title may include the following two implementations according to the number of the first text titles.

In one possible implementation, determining a specific implementation of the reference subject word for the first text title may include: if the number of the first text titles is at least two, clustering the at least two first text titles, and dividing the first text titles into N categories, wherein N is a positive integer greater than 0, and each category comprises at least one first text title; acquiring a text subject term of at least one first text title in each category; and determining the subject term of the category based on at least one text subject term of the same category, and determining the subject term of the category as a reference subject term of each first text title in the category.

In a specific implementation, when the number of the first text titles is at least two, the at least two first text titles may be clustered by a clustering method, the at least two first text titles may be divided into N categories, a text subject word of each first text title in each category is determined by a TF-IDF method, a subject word corresponding to the category is determined from the text subject words of the same category, and the subject word corresponding to the category is determined as a reference subject word of each first text title in the category.

In implementation, the first text titles may be divided according to the similarity of the text titles, and the similarity of the first text titles in the same category is higher, which indicates that the topics of text expressions corresponding to the first text titles are relatively similar, even the topics of the text expressions may be the same, that is, the first text titles in the same category correspond to the same subject word.

In some embodiments, the at least two first text titles may be clustered by a K-means clustering algorithm. As an example, the at least two first text titles may be input into a Bert model for semantic encoding, resulting in a semantic feature vector for each first text title, then randomly selecting N first text titles as category text titles, determining the similarity between other first text titles and the N category text titles based on the semantic feature vector of each first text title, dividing each other first text title into categories of the category text titles with the maximum similarity, then re-determining the title vector of each category, determining the similarity of the vectors of other first text titles which are not subjected to category division and each title vector, to divide the other first text titles into any one of the N categories, and so on, the at least two first text titles may be divided into the N categories.

In some embodiments, for any reference category, if the reference category includes at least two first text titles, a word segmentation process may be performed on each first text title in the reference category through the LAC tool, and a text subject word of each first text title is obtained, and then a subject word corresponding to the category is determined from the at least two text subject words through a TF-IDF method, and the subject word corresponding to the category is determined as the reference subject word of each first text title in the category.

As an example, an inverse document word frequency of each text topic word appearing in the text corresponding to the at least two first text titles may be determined, and a document word frequency of each text topic word appearing in the text corresponding to its corresponding text title may be determined, a product of the document word frequency and the inverse document word frequency of each text topic word is determined as a probability that each text topic word may be a topic word in the category, the text topic word corresponding to the maximum probability is determined as the topic word corresponding to the category, and the topic word corresponding to the category is determined as the reference topic word of each first text topic word in the category. For example, assuming that the reference category includes 3 first text titles, it may be determined that the subject word of the first text title a is "XXX", the subject word of the first text title B is "XXXX", and the subject word of the first text title C is "XX", the subject word of the first text title a is determined as the subject word corresponding to the reference category by the TF-IDF method, and "XXX" is determined as the reference subject word of the first text title a, the first text title B, and the first text title C.

In this implementation manner, if the number of the first text titles is at least two, the first text titles may be clustered through an unsupervised method, and the reference subject word of the first text title is determined through the TF-IDF method, and the reference subject word is not required to be determined through a training model, so that data does not need to be labeled to obtain training data, and the data labeling cost is reduced.

In another possible implementation manner, determining a specific implementation of the reference subject word of the first text title may include: if the number of the first text titles is one, acquiring the text subject terms of the first text titles, and determining the text subject terms as the reference subject terms of the first text titles.

In a specific implementation, when the number of the first text titles is one, the text subject word of the first text title may be obtained, and the text subject word of the first text title may be determined as the reference subject word of the first text title. For example, assuming that the subject term of the first text title is "XXX", the subject term "XXX" may be determined as the reference subject term of the first text title.

In this implementation, if the number of the first text titles is one, it is described that the reference dictionary may lack the subject term corresponding to the first text title, so that the text subject term corresponding to the first text title may be determined, the text subject term is determined as the reference subject term and added to the initial dictionary to obtain the reference dictionary, which may enrich the subject terms in the dictionary, make the dictionary more complete, cover as many subject terms as possible, and facilitate to quickly determine the subject term related to the problem subject term of the problem to be processed when receiving various problems to be processed.

Further, after matching the text titles of the plurality of texts with the initial subject words in the initial dictionary, the method further includes: determining a second text title with the initial subject word, and determining the initial subject word in the second text title as a reference subject word of the second text title;

accordingly, creating the knowledge-graph based on the text titles of the plurality of texts and the reference subject words in the reference dictionary comprises: and connecting the text titles with the associated relation with the reference subject words by taking the text titles of the texts and the reference subject words in the reference dictionary as nodes to obtain the knowledge graph.

In a specific implementation, if a text title including an initial subject word exists in text titles of a plurality of texts, for convenience of description, the text title including the initial subject word may be referred to as a second text title, and the initial subject word included in each second text title may be used as a reference subject word of the second text title. Thus, each reference subject word in the reference dictionary can be ensured to have a corresponding text title.

In this embodiment of the present application, an initial dictionary may be pre-constructed, then a first text title and a second text title may be determined from a plurality of text titles, for a second text title, an initial subject word included in the second text title may be used as a reference text word of each second text title, for the first text title, a reference subject word may be determined by a clustering method, and the determined reference subject word is added to the initial dictionary to obtain a reference dictionary. In this manner, a dictionary including a relatively large number of subject words can be constructed.

In some embodiments, the relationship between the subject word and the text title may also be represented in the form of a knowledge graph. Therefore, the knowledge graph can be obtained by taking the text titles of a plurality of texts and the reference subject words in the reference dictionary as nodes and connecting each text title with the reference subject word.

As an example, it is assumed that the number of the plurality of texts is 5, and the subject words of the text titles a and B are the reference subject words 1, and the subject words of the text titles C and D are the reference subject words 2, and the subject word of the text title E is the reference subject word 3. The method comprises the steps of respectively taking a text title A, a text title B, a text title C, a text title D, a text title E, a reference subject word 1, a reference subject word 2 and a reference subject word 3 as nodes, respectively connecting the reference subject word 1 with the text title A and the text title B, respectively connecting the reference subject word 2 with the text title C and the text title D, and connecting the reference subject word 3 with the text title E, so that the knowledge graph shown in the figure 3 can be obtained.

As an example, a Neo4j graph database may be used to store reference subject words and text titles, and the text titles may include attribute information such as time, place, and organization. Taking the reference dictionary as an example of a dictionary in the policy field, the text title is the title name of the policy file, and the text title may include information such as the policy issuing time and issuing organization.

In the embodiment of the application, an initial dictionary can be constructed in advance, then a reference dictionary is constructed according to a plurality of texts and the initial dictionary, then a knowledge graph is created based on a reference subject word in the reference dictionary and a plurality of text titles corresponding to the reference subject word, a large amount of labeled data is not needed for supervised model training, the knowledge graph can be created through an unsupervised clustering algorithm, and the data labeling cost is reduced. Moreover, for different fields, different knowledge maps can be constructed through the method, namely the scheme can be suitable for any field, and the universality of the scheme is improved.

Step 204: the method comprises the steps of obtaining a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of a problem subject word, the reference semantic feature vectors are feature vectors of reference subject words in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words.

In implementation, to determine a text related to the problem to be processed in the knowledge graph, the problem topic word needs to be matched with the reference topic word, the problem topic word and the reference topic word are subjected to feature extraction, a problem semantic feature vector capable of representing the problem topic word and a reference semantic feature vector capable of representing the reference topic word are obtained, the problem topic word and the reference topic word are matched through the similarity of the feature vectors, and the matching accuracy can be improved.

In an embodiment of the present application, the specific implementation of this step may include: inputting the problem topic words and the at least two reference topic words into a feature extraction layer of a feature extraction model to obtain a problem feature vector group of the problem topic words and a reference feature vector group of each reference topic word; and inputting the question feature vector group and at least two reference feature vector groups into a self-attention layer of the feature extraction model to obtain the question semantic feature vector of the question topic word and the reference semantic feature vector of each reference topic word.

As an example, a feature extraction model may be used to extract features of the input text, which may be any model that can implement a feature extraction function. For example, the feature extraction model may be a Bert model.

In a specific implementation, the problem topic word and the at least two reference topic words are input into the feature extraction layer, and word segmentation processing can be performed on the problem topic word and the at least two reference topic words respectively to obtain a plurality of word units of the problem topic word and a plurality of word units of each reference topic word. And respectively coding a plurality of word units of the problem topic words to obtain vector representation of each word unit, and splicing the vector representations to obtain a problem feature vector group of the problem topic words. Similarly, a plurality of word units of each reference subject term are encoded to obtain a vector representation of each word unit, and a plurality of vector representations of the same reference subject term are spliced to obtain a reference feature vector group of the reference subject term.

In specific implementation, the problem feature vector group is input into a self-attention layer of a feature extraction model, and self-attention calculation is carried out on each problem feature vector and other problem feature vectors to obtain a problem semantic feature vector of a problem subject word combined with context information. Similarly, at least two reference feature vector groups are input into the self-attention layer of the feature extraction model, and self-attention calculation is carried out on each reference feature vector in the same reference feature vector group and other reference feature vectors to obtain the reference semantic feature vector of the reference subject term in combination with the context information.

The semantic feature vector of the reference subject term obtained by the method can represent the semantics of the reference subject term on the whole, and the semantic feature vector of the problem subject term can represent the semantics of the problem subject term on the whole, namely the obtained semantic feature vector is more accurate.

Further, in order to reduce the calculation amount for determining the semantic feature vector and the calculation amount for determining the similarity based on the semantic feature vector, at least two reference subject words in the reference dictionary may be filtered once before extracting the semantic feature vector, so as to determine related subject words that are relatively related to the problem subject words. Therefore, before executing this step, the question topic word may be matched with the at least two reference topic words, and the similarity between the question topic word and each reference topic word in the reference dictionary may be determined; determining the reference subject term corresponding to the similarity greater than the similarity threshold as a related subject term;

accordingly, in another embodiment of the present application, if the number of the related topic words is at least two, the specific implementation of obtaining the question semantic feature vector and the at least two reference semantic feature vectors may include: and acquiring a problem semantic feature vector and related semantic feature vectors of at least two related subject words.

In a specific implementation, a fuzzy matching algorithm may be used to perform fuzzy matching on the problem topic word and at least two reference topic words, determine the similarity between the problem topic word and each reference topic word, to obtain at least two similarities, and if the similarity corresponding to a certain reference topic word is greater than a similarity threshold, it may be determined that the correlation degree between the reference topic word and the problem topic word is higher, so that the reference topic word corresponding to the similarity greater than the similarity threshold may be determined as the related topic word.

For example, if the problem to be processed is "what is the policy of XX house about the income increase of the agricultural workers", the problem topic word is extracted as "the income increase of the agricultural workers" by the rule, the problem topic word is subjected to fuzzy matching with the reference topic word in the reference dictionary, and the reference topic word with the result larger than the similarity threshold value is "the income increase of the agricultural workers", "the employment of the agricultural workers" and the like, then the problem word can be regarded as the relevant topic word.

As an example, determining the similarity of the question topic word to each topic word in the reference dictionary may include two implementations:

the first implementation mode comprises the following steps: and determining an editing distance between the problem topic word and each reference topic word, and determining similarity based on the editing distance, wherein the editing distance is used for representing the minimum editing times required for converting the reference topic word into the problem topic word, and the smaller the editing distance is, the greater the similarity is.

The second implementation mode comprises the following steps: determining a Hamming distance between the problem subject word and each reference subject word, and determining similarity based on the Hamming distance, wherein the Hamming distance is used for representing the number of different characters at corresponding positions between the reference subject word and the problem subject word, and the smaller the Hamming distance is, the greater the similarity is.

In this case, the Bert model may be used to extract the question semantic feature vector of the question topic word and the associated semantic feature vector of each of the at least two associated topic words.

It should be noted that, if at least two similarities include a similarity with a value of 1, the reference subject term corresponding to the similarity with the value of 1 may be determined as the related subject term. In this case, the problem topic word is included in the reference topic word of the reference dictionary, and the reference topic word may be determined as the target topic word.

In the embodiment of the application, at least two reference dictionaries can be screened once through a fuzzy matching algorithm to obtain related subject words, then related semantic feature vectors of the related subject words and problem semantic feature vectors of the problem subject words are extracted, fuzzy matching and similarity calculation based on vector representation are combined, the screening accuracy is improved, and the answer determining accuracy is further improved.

Step 206: and determining a target subject term from the at least two reference subject terms based on the question semantic feature vector and the at least two reference semantic feature vectors.

In a possible implementation manner, the specific implementation of this step may include: determining the similarity between the problem topic word and each reference topic word based on the problem semantic feature vector of the problem topic word and the at least two reference semantic feature vectors; and taking the reference subject term with the maximum similarity as the target subject term.

In a specific implementation, the similarity between the problem semantic feature vector and each reference semantic feature vector can be determined through a cosine similarity function to obtain at least two similarities, and the at least two similarities are ranked.

Referring to formula (1), formula (1) is a cosine similarity function, and is a formula for determining cosine similarity according to vectors:

wherein A represents the problem semantic feature vector of the problem to be processed, B represents the reference semantic feature vector of the reference subject word, n represents the dimension of the vector, A_iRepresenting the value of the ith dimension in the semantic feature vector of the question, B_iRepresenting the value of the ith dimension in the reference semantic feature vector.

It should be noted that, the above description is only given by taking cosine similarity as an example, in another implementation manner of the present application, the similarity between the problem topic word and each reference topic word may also be determined by algorithms such as euclidean distance, manhattan distance, pearson correlation coefficient, and the like, which is not limited in the embodiment of the present application.

In the implementation mode, the similarity between the problem subject term and the reference subject term is directly determined in a vector representation mode, then the reference subject term with the maximum similarity is used as the target subject term, namely, the reference subject term in the reference dictionary is screened once to obtain the target subject term, and the operation steps are reduced.

In another possible implementation manner, at least two reference subject terms may be screened once to obtain related subject terms related to the problem subject term. In this case, the specific implementation of this step may include: determining the similarity of the problem topic word and each related topic word based on the problem semantic feature vector and at least two related semantic feature vectors of the problem topic word; and taking the related subject term with the maximum similarity as the target subject term.

In a specific implementation, the similarity between the problem semantic feature vector and each related semantic feature vector can be determined through a cosine similarity function to obtain at least two similarities, and the at least two similarities are ranked.

As an example, the question semantic feature vector and the related semantic feature vector may be input into formula (1), and the similarity between the question topic word and the related topic word may be obtained.

It should be noted that, the above description is only given by taking cosine similarity as an example, in another implementation manner of the present application, the similarity between the problem topic word and each related topic word may also be determined by algorithms such as euclidean distance, manhattan distance, pearson correlation coefficient, and the like, which is not limited in the embodiment of the present application.

In the implementation mode, firstly, a fuzzy matching method is used for primarily screening reference subject words in a reference dictionary to determine some related subject words related to the problem subject words, then the similarity between the problem subject words and the related subject words is determined in a vector representation mode, and the related subject words with the maximum similarity are used as target subject words. Therefore, the fuzzy matching mode and the similarity matching mode based on vector representation are combined, the determined target subject term has high similarity with the problem subject term in the fuzzy matching mode, and has high similarity with the problem subject term in the similarity matching mode based on the characteristic vector, so that the obtained target subject term is more accurate.

Further, after determining the reference subject term corresponding to the similarity greater than the similarity threshold as the related subject term, the method further includes: and if the number of the related subject words is one, determining the related subject words as the target subject words.

That is, in the case where the number of related subject words is one, secondary screening is not required, and the related subject word can be determined as the target subject word.

In this embodiment of the application, when the number of the related topic words is at least two, at least two similarities may be determined according to the related semantic feature vectors of the at least two related topic words and the question semantic feature vector of the question topic word, and the related topic word corresponding to the maximum similarity is determined as the target topic word. By combining fuzzy matching and a vector representation-based method, the accuracy of determining the target subject term can be improved, and the accuracy of the extracted answer is further improved.

Step 208: and determining a target text related to the target subject word from a pre-established knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an incidence relation between a reference subject word and a text title.

In specific implementation, a text title connected with a target subject word can be determined from a knowledge graph, a target text corresponding to the text title is obtained, and then an answer of a to-be-processed question is obtained from the target text.

As an example, referring to fig. 3, fig. 3 is a schematic diagram of a knowledge graph provided in an embodiment of the present application. Assuming that the target subject term is the reference subject term 1, and it can be determined that the text title a and the text title B are connected to the reference subject term 1, the text a corresponding to the text title a and the text B corresponding to the text title B can be obtained, and the answer to the question to be processed can be obtained from the text a and the text B.

Further, after obtaining and analyzing the problem to be processed and determining the problem subject term in the problem to be processed, the method further includes: determining a question condition word in the question to be processed;

correspondingly, the nodes of the knowledge graph are reference subject words and text titles, and the target text related to the target subject word is determined from the pre-created knowledge graph based on the target subject word, and the method comprises the following steps: determining a text title connected with the target subject term from the knowledge graph as a candidate text title; determining a text title including the question condition word as a target text title from the candidate text titles; and acquiring a text corresponding to the target text title as the target text.

In a specific implementation, the to-be-processed problem may include qualifiers such as time, place, organization name, and the like, and the qualifiers can further qualify the to-be-processed problem. When the answer of the to-be-processed question is determined, although the target subject term of the to-be-processed question can already represent the subject of the to-be-processed question, the number of texts determined only by the target subject term is large, some texts may include the answer of the to-be-processed question, some texts may not necessarily include the answer of the to-be-processed question, and the range of the texts related to the to-be-processed question can be narrowed by increasing the limiting words, and the target subject term and the limiting words are considered when the texts are determined, so that the determined texts are more related to the to-be-processed question, and the accuracy of determining the answer can be improved. Therefore, when the problem to be processed is analyzed, the problem condition words can be determined through the part-of-speech tags. In this case, a text title connected to the target subject word may be determined from the knowledge graph as a candidate text title, a text title including the problem condition word may be determined from the candidate text titles as a target text title through a further problem condition word, and a text corresponding to the target text title may be obtained as a target text.

As an example, after the to-be-processed question is parsed, question subject words, time, organization names, etc. in the to-be-processed question may be determined, and these information may be converted into a form of Neo4j logical expression query language, and then the corresponding answer may be queried from the knowledge graph based on the obtained logical expression query language. For example, assuming that the question to be processed is "what is a policy of XX hospital about a certain item", the obtained logical expression query language may be "[ [ [ 'XX hospital', 'certain item' ], [ 'ORG', 'TAG' ]", where an organization name (ORG) corresponds to "XX hospital", and a question subject word (TAG) corresponds to "certain item", and a text title connected to "certain item" may be first queried from the knowledge graph as a candidate text title, then a text title including "XX hospital" in the candidate text title may be used as a target text title, and a target text corresponding to the target text title may be obtained, and then an answer to the question to be processed may be obtained from the target text.

In the embodiment of the application, after the subject term of the problem to be processed is determined, the problem condition term in the problem to be processed can be determined, the target text is determined according to the target subject term and the problem condition term, the accuracy of the determined target text can be improved, the correlation degree between the target text and the problem to be processed is higher, and the accuracy of determining the answer of the problem to be processed is further improved.

In some embodiments, the knowledge graph may include a storage address of a text corresponding to each text title, where each storage address is connected to a corresponding text title, and in this case, the specific implementation of obtaining the text corresponding to the target text title as the target text may include: determining a storage address connected with the target text title as a storage address of the text corresponding to the target text title; and acquiring the target text from the storage address.

In specific implementation, each target text title in the knowledge graph is connected with the storage address of the corresponding target text, so that after the target text title is determined, the storage address connected with the target text title can be obtained, and the target text is obtained from the storage address. Therefore, the storage address is also added to the knowledge graph, the target text can be acquired from the knowledge graph through the storage address, the text acquisition efficiency is improved, and the answer acquisition efficiency is further improved.

In the embodiment of the application, a problem to be processed is obtained and analyzed, and a problem subject term in the problem to be processed is determined; acquiring a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of the problem subject word, the reference semantic feature vector is a feature vector of a reference subject word in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words; determining a target subject term from the at least two reference subject terms based on the question semantic feature vector and at least two reference semantic feature vectors; and determining a target text related to the target subject word from a pre-established knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an incidence relation between a reference subject word and a text title. According to the scheme, the problem subject words and the reference subject words are converted into vector representation forms, the problem semantic feature vectors can accurately represent the semantics of the problem subject words, the reference semantic feature vectors can accurately represent the semantics of the reference subject words, the determination of the target subject words based on the problem semantic feature vectors and the reference semantic feature vectors is more accurate, and the accuracy of the target texts and the obtained answers determined based on the target subject words is higher.

The knowledge-graph-based question-answering method provided in the present specification is further described below with reference to fig. 4, taking the application of the knowledge-graph-based question-answering method in a policy question-answering task as an example. Fig. 4 shows a processing flow chart of a knowledge-graph-based question-answering method applied to a policy question-answering task according to an embodiment of the present specification, which may specifically include the following steps:

step 402: an initial dictionary of a pre-constructed policy field and a plurality of policy texts are obtained.

Illustratively, the policy title of the policy text may be analyzed, and the content in "and quotation marks in the policy title, comments about XX, and the like are extracted through rules, and the initial dictionary of the policy field is constructed by preliminarily manually arranging part of the policy subject words, and the initial dictionary includes subject words that may be used in the text of the policy field. For example, if the policy title of the policy text is "XX college's guidance opinion on speeding up work of" internet + government service ", the" internet + government service "therein may be extracted as the policy topic.

In addition, since the initial dictionary is preliminarily constructed by manual sorting and may not be complete, a plurality of policy texts may be obtained, and subject words are obtained from the policy texts and added to the initial dictionary, so as to enrich the initial dictionary for subsequent use.

Step 404: and matching the policy subject names of the policy texts with the initial subject words in the initial dictionary, and determining the first policy subject name without the initial subject words.

For example, the policy header names of the policy texts are filtered by the initial dictionary, the policy header names are compared with the initial subject word in the initial dictionary, and if the policy header names include the initial subject word that does not exist in the initial dictionary, the policy header name that does not include the initial subject word in the policy header names can be determined as the first policy header name.

Step 406: if the number of the first policy title names is at least two, clustering the at least two first policy title names, and dividing the first policy title names into N categories, wherein N is a positive integer greater than 0, and each category comprises at least one first policy title name.

Step 408: and obtaining policy subject words in at least one first policy subject name in each category.

Step 410: and determining the subject term of the category based on at least one policy subject term of the same category, and determining the subject term of the category as a reference subject term of each first policy subject name in the category.

For example, if there are at least two first policy title names, the at least two policy title names may be subjected to semantic coding through the pre-training language model Bert to obtain a semantic feature vector of each policy title name, and the at least two policy title names are clustered by using a K-means clustering algorithm to form N different clusters. For the policy topic names in each cluster, firstly, using an LAC word segmentation tool to segment the policy topic names, then obtaining the subject words in the cluster by a TF-IDF statistical method, then extracting the subject words of the cluster, and determining the subject words of the cluster as the reference subject words of each first policy topic name in the cluster.

Step 412: if the number of the first policy title is one, obtaining the policy subject term of the first policy title, and determining the policy subject term as the reference subject term of the first policy title.

For example, if the number of the first policy title is one, the subject term of the first policy title may be extracted, the policy subject term of the first policy title may be determined, and the policy subject term may be used as the reference subject term of the first policy title.

Step 414: and adding the reference subject word of the first policy subject name to the initial dictionary to obtain a reference dictionary.

For example, assuming that the reference subject word of the first policy subject name is "a certain item" and this subject word is not included in the initial dictionary, "a certain item" may be added to the initial dictionary, and the initial dictionary to which the new subject word is added is referred to as a reference dictionary.

Step 416: and determining the second policy topic name with the initial topic words, and determining the initial topic words in the second policy topic name as the reference topic words of the second policy topic name.

For example, if the policy title includes an initial subject word in the initial dictionary, the policy title is referred to as a second policy title, and for the second policy title, the initial subject word corresponding to the second policy title may be directly determined as the reference subject word of the second policy title.

Step 418: and connecting the policy subject names with the incidence relation with the reference subject words by taking the policy subject names of the texts and the reference subject words in the reference dictionary as nodes to obtain the knowledge graph.

For example, a knowledge map is constructed by the policy header names of a plurality of texts and the reference subject words in the reference dictionary, and the relationship between the policy header names and the reference subject words, and a Neo4J database can be used to store the policy information in the knowledge map. The policy information may include a policy subject term and a policy subject name, and may further include an address link of the policy text, and the policy subject name may include various attributes: policy delivery time, delivery authority, etc. Referring to fig. 5, fig. 5 is a schematic diagram of another knowledge graph provided in the embodiment of the present application, where the knowledge graph corresponds to a subject term "agricultural and civil engineering income increase", a node 1 represents a policy subject term, and nodes 2 and 3 represent policy subject names.

Step 420: inputting the to-be-processed problem into a part-of-speech tagging model, and determining a problem subject word and a problem condition word in the to-be-processed problem.

For example, assuming that the problem to be processed is "what is the policy of the XX mechanism about the agricultural and civil work income increase", the problem subject word "agricultural and civil work income increase" and the problem condition word "XX mechanism" can be extracted by the part-of-speech tagging model.

It should be noted that steps 402 to 420 are the lower description of step 202, the implementation process is the same as the implementation process of step 202, and specific implementation may refer to the related description of step 202, which is not described herein again.

Step 422: and matching the problem topic word with at least two reference topic words in a reference dictionary, and determining the similarity between the problem topic word and each reference topic word in the reference dictionary.

Step 424: and determining the reference subject term corresponding to the similarity greater than the similarity threshold as the related subject term.

For example, assuming that at least two reference subject words include "farmer income", "farmer employment", "farmer wage", "rural problem", the problem subject word "farmer income" and the above reference subject words may be Fuzzy-matched using a Fuzzy Wuzzy algorithm, respectively, a similarity between the problem subject word and each reference subject word may be determined, and a similarity threshold may be set, and the reference subject words having a similarity greater than the similarity threshold may include "farmer income" and "farmer employment", the "farmer income" and "farmer employment" may be determined as related subject words.

Step 426: and if the number of the related subject words is at least two, extracting the problem semantic feature vector of the problem subject word and the related semantic feature vectors of the at least two related subject words.

Step 428: and determining the similarity of the problem subject word and each related subject word based on the problem semantic feature vector and at least two related semantic feature vectors of the problem subject word.

Step 430: and taking the related subject word with the maximum similarity as a target subject word.

For example, taking the case that the problem topic word is "rural income growth", and the related topic words are "rural income growth" and "rural employment" as an example, the problem semantic feature vector of "rural income growth" can be extracted through the Bert model, the related semantic feature vectors of "rural income growth" and "rural employment" can be extracted, and assuming that the similarity between "rural income growth" and "rural income growth" calculated according to the problem semantic feature vector and at least two related semantic feature vectors is 0.9, and the similarity between "rural income growth" and "rural employment" calculated is 0.8, the "rural income growth" can be taken as the target topic word.

Step 432: and if the number of the related subject words is one, determining the related subject words as the target subject words.

For example, taking the case that the problem topic word is "rural income increase" and the related topic word is "rural income increase", the target topic word may be "rural income increase".

It should be noted that, step 422 to step 432 are the lower descriptions of step 204 and step 206, the implementation process is the same as the implementation process of step 204 and step 206, and specific implementation may refer to the relevant descriptions of step 204 and step 206, which is not described herein again.

Step 434: and determining the text titles connected with the target subject word from the knowledge graph as candidate text titles.

Step 436: and determining a text title comprising the question condition word from the candidate text titles as a target text title.

Step 438: the knowledge map comprises storage addresses of texts corresponding to the policy title names, each storage address is connected with the corresponding policy title name, and the storage address connected with the target text title is determined as the storage address of the text corresponding to the target text title.

Step 440: and acquiring the target text from the storage address.

Step 442: an answer to the question to be processed is determined from the target text.

It should be noted that steps 434 to 442 are a lower description of step 208, and the implementation process is the same as the implementation process of step 208, and for specific implementation, reference may be made to the related description of step 208, and details of this embodiment are not described herein again.

The technical scheme is that a knowledge graph is constructed mainly based on a policy text, a pre-training language model Bert is used as a tool for extracting semantic features, policy subject words are extracted by adopting a rule + policy field reference dictionary and a clustering method, in the process of matching the subject words in the knowledge graph, a target subject word is searched by adopting a fuzzy matching mode and a similarity mode, then the target text is determined based on the target subject word, and the answer of the problem to be processed is determined from the target text. Because the problem semantic feature vector can accurately represent the semantics of the problem subject term, the reference semantic feature vector can accurately represent the semantics of the reference subject term, the determination of the target subject term based on the problem semantic feature vector and the reference semantic feature vector is more accurate, and the accuracy of the target text determined based on the target subject term and the accuracy of the obtained answer are higher.

Corresponding to the above method embodiments, the present application further provides an embodiment of a knowledge-graph-based question answering device, and fig. 6 shows a schematic structural diagram of a knowledge-graph-based question answering device according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 may include:

a first obtaining module 602, configured to obtain and analyze a problem to be processed, and determine a problem topic word in the problem to be processed;

a second obtaining module 604, configured to obtain a question semantic feature vector and at least two reference semantic feature vectors, wherein the question semantic feature vector is a feature vector of the question subject word, the reference semantic feature vectors are feature vectors of reference subject words in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words;

a first determining module 606 configured to determine a target subject term from the at least two reference subject terms based on the question semantic feature vector and the at least two reference semantic feature vectors;

a second determining module 608, configured to determine, based on the target subject word, a target text related to the target subject word from a pre-created knowledge graph, and determine an answer to the question to be processed from the target text, where the knowledge graph includes an association relationship between a reference subject word and a text title.

Optionally, the second obtaining module 604 is configured to:

inputting the problem topic words and the at least two reference topic words into a feature extraction layer of a feature extraction model to obtain a problem feature vector group of the problem topic words and a reference feature vector group of each reference topic word;

and inputting the question feature vector group and at least two reference feature vector groups into a self-attention layer of the feature extraction model to obtain the question semantic feature vector of the question topic word and the reference semantic feature vector of each reference topic word.

Optionally, the first determining module 606 is configured to:

determining the similarity between the problem topic word and each reference topic word based on the problem semantic feature vector of the problem topic word and the at least two reference semantic feature vectors;

and taking the reference subject term with the maximum similarity as the target subject term.

Optionally, the second obtaining module 604 is further configured to:

matching the problem topic word with the at least two reference topic words, and determining the similarity between the problem topic word and each reference topic word in the reference dictionary;

determining the reference subject term corresponding to the similarity greater than the similarity threshold as a related subject term;

and if the number of the related subject words is at least two, acquiring the problem semantic feature vector and the related semantic feature vectors of the at least two related subject words.

Optionally, the first determining module 606 is configured to:

determining the similarity of the problem topic word and each related topic word based on the problem semantic feature vector and at least two related semantic feature vectors of the problem topic word;

and taking the related subject term with the maximum similarity as the target subject term.

Optionally, the first determining module 606 is further configured to:

and if the number of the related subject words is one, determining the related subject words as the target subject words.

Optionally, the first obtaining module 602 is further configured to:

determining a question condition word in the question to be processed;

the nodes of the knowledge graph are reference subject words and text titles, and the text titles connected with the target subject words are determined from the knowledge graph to serve as candidate text titles;

determining a text title including the question condition word as a target text title from the candidate text titles;

and acquiring a text corresponding to the target text title as the target text.

Optionally, the first obtaining module 602 is further configured to:

the knowledge graph comprises a storage address of a text corresponding to each text title, each storage address is connected with the corresponding text title, and the storage address connected with the target text title is determined as the storage address of the text corresponding to the target text title;

and acquiring the target text from the storage address.

Optionally, the first obtaining module 602 is configured to:

the method comprises the steps of obtaining a problem to be processed, inputting the problem to be processed into a part-of-speech tagging model, and determining a part-of-speech tag of each word unit in the problem to be processed;

and determining a problem subject word in the problem to be processed based on the part-of-speech label of each word unit.

Optionally, the first obtaining module 602 is configured to:

the part-of-speech tagging model comprises a coding unit, a gating circulation unit and a decoding unit, and the problem to be processed is input into the coding unit to obtain a word vector sequence of the problem to be processed;

inputting the word vector sequence into the gating circulation unit to obtain a problem vector sequence of the problem to be processed;

and inputting the question vector sequence into the decoding unit, and determining the part of speech label of each word unit.

Optionally, the first obtaining module 602 is further configured to:

constructing the reference dictionary based on the obtained plurality of texts;

creating the knowledge-graph based on text titles of the plurality of texts and reference subject words in the reference dictionary.

Optionally, the first obtaining module 602 is further configured to:

acquiring a pre-constructed initial dictionary;

matching the text titles of the texts with initial subject words in the initial dictionary, determining a first text title which does not contain the initial subject words, and determining a reference subject word of the first text title;

and adding the reference subject word of the first text title to the initial dictionary to obtain the reference dictionary.

Optionally, the first obtaining module 602 is further configured to:

if the number of the first text titles is at least two, clustering the at least two first text titles, and dividing the first text titles into N categories, wherein N is a positive integer greater than 0, and each category comprises at least one first text title;

acquiring a text subject term of at least one first text title in each category;

and determining the subject term of the category based on at least one text subject term of the same category, and determining the subject term of the category as a reference subject term of each first text title in the category.

Optionally, the first obtaining module 602 is further configured to:

if the number of the first text titles is one, acquiring the text subject terms of the first text titles, and determining the text subject terms as the reference subject terms of the first text titles.

Optionally, the first obtaining module 602 is further configured to:

determining a second text title with the initial subject word, and determining the initial subject word in the second text title as a reference subject word of the second text title;

and connecting the text titles with the associated relation with the reference subject words by taking the text titles of the texts and the reference subject words in the reference dictionary as nodes to obtain the knowledge graph.

The question answering device based on the knowledge graph provided by the embodiment of the application obtains and analyzes the problem to be processed, and determines the problem subject words in the problem to be processed; acquiring a problem semantic feature vector and at least two reference semantic feature vectors, wherein the problem semantic feature vector is a feature vector of the problem subject word, the reference semantic feature vector is a feature vector of a reference subject word in a pre-constructed reference dictionary, and the reference dictionary comprises at least two reference subject words; determining a target subject term from the at least two reference subject terms based on the question semantic feature vector and at least two reference semantic feature vectors; and determining a target text related to the target subject word from a pre-established knowledge graph based on the target subject word, and determining an answer of the to-be-processed question from the target text, wherein the knowledge graph comprises an incidence relation between a reference subject word and a text title. According to the scheme, the problem subject words and the reference subject words are converted into vector representation forms, the problem semantic feature vectors can accurately represent the semantics of the problem subject words, the reference semantic feature vectors can accurately represent the semantics of the reference subject words, the determination of the target subject words based on the problem semantic feature vectors and the reference semantic feature vectors is more accurate, and the accuracy of the target texts and the obtained answers determined based on the target subject words is higher.

The above is an illustrative scheme of a knowledge-graph-based question answering device according to the embodiment. It should be noted that the technical solution of the knowledge-graph-based question answering device and the technical solution of the knowledge-graph-based question answering method belong to the same concept, and details of the technical solution of the knowledge-graph-based question answering device, which are not described in detail, can be referred to the description of the technical solution of the knowledge-graph-based question answering method.

It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.

There is also provided in an embodiment of the present application a computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the knowledge-graph based question-answering method when executing the instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned knowledge-graph-based question-answering method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned knowledge-graph-based question-answering method.

An embodiment of the present application also provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the knowledge-graph based question-answering method as described above.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above-mentioned knowledge-graph-based question-answering method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned knowledge-graph-based question-answering method.

The embodiment of the application discloses a chip, which stores computer instructions, and the instructions are executed by a processor to realize the steps of the knowledge-graph-based question answering method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A question-answering method based on a knowledge graph is characterized by comprising the following steps:

2. The knowledge-graph-based question-answering method according to claim 1, wherein obtaining a question semantic feature vector and at least two reference semantic feature vectors comprises:

3. The knowledge-graph-based question answering method according to claim 1, wherein determining a target subject word from the at least two reference subject words based on the question semantic feature vector and at least two reference semantic feature vectors comprises:

4. The knowledge-graph-based question-answering method according to claim 1, wherein before obtaining the question semantic feature vector and the at least two reference semantic feature vectors, further comprising:

correspondingly, if the number of the related subject terms is at least two, obtaining a problem semantic feature vector and at least two reference semantic feature vectors, including:

and acquiring the problem semantic feature vector and the related semantic feature vectors of at least two related subject terms.

5. The knowledge-graph-based question answering method according to claim 4, wherein determining a target subject word from the at least two reference subject words based on the question semantic feature vector and at least two reference semantic feature vectors comprises:

6. The knowledge-graph-based question-answering method according to claim 4, wherein after determining the reference subject words corresponding to the similarity greater than the similarity threshold as the related subject words, further comprising:

7. The knowledge-graph-based question-answering method according to claim 1, wherein after a question to be processed is obtained and analyzed, and a question subject word in the question to be processed is determined, the method further comprises:

determining a question condition word in the question to be processed;

correspondingly, the nodes of the knowledge graph are reference subject words and text titles, and the target text related to the target subject word is determined from the pre-created knowledge graph based on the target subject word, and the method comprises the following steps:

determining a text title connected with the target subject term from the knowledge graph as a candidate text title;

and acquiring a text corresponding to the target text title as the target text.

8. The knowledge-graph-based question-answering method according to claim 7, wherein the knowledge graph includes a storage address of a text corresponding to each text title, each storage address is connected with a corresponding text title, and the text corresponding to the target text title is acquired as the target text, including:

determining a storage address connected with the target text title as a storage address of the text corresponding to the target text title;

and acquiring the target text from the storage address.

9. The knowledge-graph-based question-answering method according to claim 1, wherein the obtaining and analyzing of the to-be-processed question and the determination of the question subject words in the to-be-processed question comprise:

10. The knowledge-graph-based question-answering method according to claim 9, wherein the part-of-speech tagging model comprises an encoding unit, a gate control loop unit and a decoding unit, the to-be-processed question is input into the part-of-speech tagging model, and part-of-speech tags of each word unit in the to-be-processed question are determined, and the method comprises the following steps:

inputting the problem to be processed into the coding unit to obtain a word vector sequence of the problem to be processed;

11. The knowledge-graph-based question-answering method according to any one of claims 1 to 10, wherein before the question to be processed is obtained and resolved, it further comprises:

constructing the reference dictionary based on the obtained plurality of texts;

12. The knowledge-graph-based question-answering method according to claim 11, wherein before constructing the reference dictionary based on the acquired plurality of texts, further comprising:

acquiring a pre-constructed initial dictionary;

accordingly, constructing the reference dictionary based on the obtained plurality of texts comprises:

13. The knowledge-graph-based question answering method according to claim 12, wherein determining the reference subject word of the first text title comprises:

14. The knowledge-graph-based question answering method according to claim 12, wherein determining the reference subject word of the first text title comprises:

15. The knowledge-graph-based question answering method according to claim 12, wherein after matching the text titles of the plurality of texts with the initial subject words in the initial dictionary, further comprising:

accordingly, creating the knowledge-graph based on the text titles of the plurality of texts and the reference subject words in the reference dictionary comprises:

16. A knowledge-graph-based question answering device, comprising:

17. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor when executing the instructions performs the steps of the knowledge-graph based question-answering method of any one of claims 1 to 15.

18. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the knowledge-graph based question-answering method according to any one of claims 1 to 15.