CN115658845A

CN115658845A - Intelligent question-answering method and device suitable for open-source software supply chain

Info

Publication number: CN115658845A
Application number: CN202211212061.5A
Authority: CN
Inventors: 吴敬征; 崔星; 罗天悦; 武延军; 邱志国
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-31

Abstract

The invention provides an intelligent question answering method and device suitable for an open source software supply chain. The method comprises the following steps: 1) Question similarity measurement: calculating the similarity between the questions input by the user and the questions in the predefined FAQ library, and if the similarity exceeds a certain threshold, directly returning answers of the corresponding questions in the FAQ library to the user; 2) Question analysis: semantic analysis is carried out on the user question to obtain information such as the type of the user question, the entity object contained in the user question and the like; 3) Answer generation: generating candidate path subgraphs according to the question semantic parsing result, scoring and sequencing the candidate subgraphs, and returning corresponding information to the user; 4) And prompting the user to evaluate the answer result, and updating the FAQ library if the problem is solved correctly. The method and the system can improve the accuracy of intelligent question answering performance in the knowledge graph of the open source software supply chain.

Description

Intelligent question-answering method and device suitable for open-source software supply chain

Technical Field

The invention relates to the technical field of computers, in particular to an intelligent question answering method and device suitable for an open source software supply chain.

Background

In recent years, information technology has been rapidly developed, and the internet and the mobile internet have been rapidly popularized. The internet has become an important way for knowledge acquisition and information dissemination by virtue of high efficiency and convenience of information dissemination. However, the explosive increase of information also increases the difficulty of people for acquiring correct information, and users are disturbed by a large amount of redundant information in the actual use process, so that the efficiency of acquiring information becomes low.

The open source software has the characteristics of openness, sharing, freedom and the like, and plays an increasingly important role in software development. Since the code of the open source software is freely accessible and the open source community is very active, more and more developers apply it to actual development work. It is becoming increasingly important to be able to accurately and quickly obtain information about software including source code, intellectual property, and maintenance information. In the past, the main way for people to acquire information is through a search engine, and a traditional search engine returns matched webpages according to certain rules through analyzing keywords input by a user, but the most direct answers cannot be directly given to the user. The intelligent question-answering system is a high-level form of an information retrieval system, and can answer the questions of the user by analyzing the question of the user and using a simple and accurate natural language, so that the efficiency of the user for acquiring knowledge is improved. In order to improve the accuracy of the question and answer result, the intelligent question and answer technology is usually combined with the knowledge graph technology at present, and the accurate answer of the user question is found through the knowledge graph in the related field. However, the knowledge graph usually contains massive information, so when a user asks a question, how to find the most relevant information to the user question and generate a final answer is a question faced by the knowledge graph-based intelligent question-answering system.

The input of the intelligent question-answering system is a user question, and the output is an answer. Current solutions can be divided into the following two categories:

1) Based on semantic parsing: the method comprises the steps of firstly carrying out deep syntax analysis on a question, then combining analysis results into executable logic expressions (such as spark QL and Cypher), and directly inquiring answers from a graph database.

2) Extracting based on the information: firstly, a main entity of a question is analyzed, then a single triple or a plurality of triples related to the entity are inquired from a knowledge graph to form a sub-graph path, then the question and the sub-graph path are respectively coded and sequenced, and a path with the highest score is returned as an answer.

The method based on semantic analysis has stronger interpretability, but the method has the problem of large labeling quantity, a large number of natural language logic expressions need to be labeled manually, and the method of information extraction is more biased to an end-to-end scheme, and the method is better in the case of complex problems and few samples, but if the subgraph is too large, the calculation speed can be obviously reduced.

Disclosure of Invention

Based on the background and the technical current situation, the invention provides an intelligent question-answering method which can quickly inquire information related to user questions from an open source software supply chain knowledge graph and simultaneously generate corresponding replies.

In order to realize the purpose, the invention adopts the following technical scheme:

an intelligent question-answering method suitable for an open-source software supply chain comprises the following steps:

question similarity measurement: calculating the similarity between the questions input by the user and the questions in a predefined FAQ library, and if the similarity exceeds a certain threshold, directly returning answers of the corresponding questions in the FAQ library to the user;

and (3) question analysis: if the similarity between the questions input by the user and the questions in the FAQ library does not exceed the threshold, analyzing the questions input by the user through a pre-trained language model, and extracting semantic information in the questions input by the user;

and (3) answer generation: and generating an answer corresponding to the question input by the user based on the question input by the user and the extracted semantic information by combining the knowledge graph.

Further, the question similarity measure includes: in order to improve the effect of question answering, firstly, an FAQ module for measuring the similarity of question sentences is introduced. The function of this module is that, for an input question, if there is a similar question in a predefined FAQ (Frequently Asked Questions) library and the similarity exceeds a certain threshold, the answer to the corresponding question in the library can be returned directly to the user. And (3) calculating the similarity of the question sentences by splicing the two question sentences through [ SEP ] through a BERT pre-training model and adding [ CLS ] at the head. And taking the question with the highest similarity and the score exceeding the threshold value as a similar question, and returning the answer to the user.

Further, the question parsing includes: and analyzing the question of the user through the pre-trained language model, and determining the key information related in the question. Semantic information in user questions is obtained mainly through a pre-trained language model, and the semantic information comprises user question classification, named entity recognition, entity linkage and the like.

Further, the 3) answer generation includes: and outputting candidate path subgraphs (a plurality of triple objects) related to the user question in the knowledge graph by combining the knowledge graph based on the user question and the extracted semantic information, scoring and sequencing the candidate path subgraphs according to the user question, and generating an answer corresponding to the information of the affiliated question if the score of the candidate path subgraphs meets a certain threshold, namely taking the candidate path subgraphs meeting the threshold as the answer of the question input by the user.

Further, the method further comprises the step of updating the FAQ library: and after the conversation is finished, prompting a user to evaluate whether the problem is solved, and if the problem is solved, adding the corresponding problem and answer into an FAQ library.

An intelligent question answering device suitable for an open source software supply chain, comprising:

the question similarity measurement module is used for calculating the similarity between the questions input by the user and the questions in the predefined FAQ library, and directly returning answers to the corresponding questions in the FAQ library to the user if the similarity exceeds a certain threshold;

the question parsing module is used for parsing the questions input by the user through the pre-trained language model and extracting semantic information in the questions input by the user when the similarity between the questions input by the user and the questions in the FAQ library does not exceed the threshold;

and the answer generating module is used for generating an answer corresponding to the question input by the user based on the question input by the user and the extracted semantic information by combining the knowledge graph.

Furthermore, the device also comprises an FAQ library updating module which is used for prompting the user to evaluate whether the problem is solved or not after the conversation is completed, and adding the corresponding problem and answer into the FAQ library if the problem is solved.

The beneficial effects of the invention are:

the intelligent question-answering method can quickly inquire the information related to the user question from the knowledge graph of the open source software supply chain, and meanwhile, generates the corresponding reply, and can improve the accuracy of the intelligent question-answering performance in the knowledge graph of the open source software supply chain. Meanwhile, by means of the semantic analysis model, the invention obtains various semantic elements of the information in the problem, can solve the limitation that the template needs to be manually defined in advance, and realizes the universal adaptation to complex and various question and answer scenes.

Drawings

FIG. 1 is a flow diagram of an intelligent question-answering method suitable for use in an open source software supply chain.

FIG. 2 is a schematic diagram of a semantic parsing model structure.

FIG. 3 is a diagrammatic view of a BERT-based path sub-graph selection order model.

Detailed Description

The technical scheme provided by the invention is suitable for various intelligent question-answering application scenes, and by adopting the technical scheme provided by the invention, the user problems can be understood and the corresponding answers can be generated by combining the knowledge graph of the open source software supply chain, so that the problems provided by the user can be accurately and efficiently solved.

Referring to fig. 1, a flow chart of steps of the intelligent question answering method of the present invention is shown, and detailed steps are described as follows:

1. question similarity measurement:

reading the questions input by the user, carrying out similarity judgment on the questions and the existing questions in the FAQ library, if the similarity between the questions and the existing questions exceeds a threshold value, considering that the user questions and the existing questions are the same questions, and directly returning the answers of the corresponding questions in the library to the user as correct answers. If the similarity to the already problematic question does not exceed the threshold, step 2) is performed.

2. User problem resolution

The method mainly comprises the steps of carrying out semantic analysis on a user problem through a pre-trained semantic analysis model, and determining semantic element information contained in the user problem. Including user question types, entity information involved in the questions, entity links, etc.

The user question understanding is a core module in the intelligent question-answering system and is responsible for carrying out fine-grained semantic understanding on each component in a question sentence, and the contained modules comprise user question classification, entity identification and extraction, entity linkage and the like.

The user problem classification is mainly analyzed according to the actual requirements of system use, and different processing strategies are adopted according to different problem types. Common problem types include the following: the method comprises the following steps of entity query type problems and relation query type problems, wherein the relation query can be divided into single relation type problems, multi-relation type problems and the like.

The entity query problem refers to querying information such as attributes related to a specific entity. For example, if the user question is "nano" and nano is the name of the software package, the question is an entity query type question, and all information related to nano in the map is returned as an answer. The single relationship class problem generally refers to querying the attribute of an entity, such as the user question "what is the size of the nano software package? ". Not only does the physical object nano need to be found, but it should be understood that the query is targeted to the size attribute of the object. Multiple relationship-class problems, which require querying one entity for the value of a correlation attribute of another entity with which it is correlated, are often more common. For example, "what is the description information of the latest defect of the nano software package? ".

After the problem type is determined, the semantics of the sentence needs to be analyzed, the first step is named entity recognition, the entity recognition is an important step in the problem analysis of the user, and the task is completed based on a sequence labeling model. The type of the bug may be a software name, a bug name, time, size and the like. Specifically, entity identification is achieved in the present invention as follows.

First, the labeling of the data is chosen as a "BIO" tag, where "B" denotes the beginning of the mention of the entity, "I" denotes the middle or end of the entity, and "O" denotes a non-entity word. Meanwhile, in consideration of the entity nesting problem, in the invention, the entity recognition model can output two results of coarse granularity and fine granularity at the same time, thus solving the Out-of-Vocalburary (OOV) problem and ensuring the follow-up module to fully understand the user problem.

In order to reduce the cost in the aspect, a three-order computing framework of NEEDLE is used, a small amount of strong label data is combined with a large amount of weak label data, and the model is trained in a staged mode, so that the named entity recognition effect can be improved better through the capacity of the weak label data. The method mainly comprises three steps of firstly using label-free data to pre-train a model, then using a large amount of weak label data to perform completion operation to obtain a relatively perfect weak label, and finally using a small amount of strong label data to perform final fine tuning on the model. The specific process is as follows:

(1) And pre-training by using label-free data, wherein the pre-training adopts a Masked Language Model, and the training data uses data in related fields.

(2) And converting the non-label data into weak label data by utilizing the knowledge in the related field of the open source software supply chain. The model is pre-trained using in-domain data if labeled and strong labeled data. And replacing the head of the MLM model with a CRF classifier to obtain a model Initial Bert-CRF.

(3) And then Weak Label Completion (Weak Label Completion) is carried out by using the model, and the step mainly combines the predicted value obtained by the model with the existing Weak Label data. The specific formula is as follows:

wherein,

express according to

And

the obtained labels of the completion data are combined according to the rules,

the label predicted by the model is represented,

represents weak tag data and o (non-entity) represents a non-entity tag.

Imperfections in weak labels can be resolved by this step.

(4) According to the complete weak label data, the Pre-trained BERT-CRF can be obtained by training on the basis of the Initial BERT-CRF. To alleviate this problem, an estimated confidence loss function based on the modified weak labels is used in the present invention, since using negative log-likelihood estimation on weak label data would cause the model to over-fit the weak label data. The loss function is specifically expressed as follows:

wherein l _NA A function representing the confidence loss of the estimate is expressed,

indicates the sequence tag after the modification of the sequence tag,

the results of the model predictions are represented,

representing input data, theta representing a model parameter, E representing a cross entropy function,

a real tag representing the m-th entity,

a revised label representing the mth entity,

which represents the m-th entity data,

representing the true sequence tag.

Wherein

To indicate a function, a Boolean expression is 1 if true, and 0 otherwise.

Representing the confidence estimate, i.e., the probability that the model predicted a weak tag equal its actual tag.

Is negative log-likelihood (negative log likelihood),

representing the probability that the model's predicted weak label is not equal to its actual label,

is equivalent to

The above formula is equivalent to the desired procedure:

when in use

When it is taken

As a loss;

when in use

When it is taken

As a loss.

(5) The above steps may prevent the model from overfitting the noise of the weak tags, while also potentially suppressing the set of data for the strong tags. We therefore finally need to fine-tune the model again on the strongly labeled data.

After the named entity object is obtained through the above steps, the entity object in the user question needs to be matched to the relevant node in the knowledge graph through the entity link. The key technology for solving the problem of accurate linkage of the text in the natural language to the knowledge base concept and the entity during entity linkage. In the present invention, in order to cover a plurality of entities involved in a complex problem and reduce the error rate of the entity link module as much as possible, the entity link module may retain entity objects having high relevance to the problem as much as possible. The specific operation method comprises the following steps: assume that the result set of entity identification modules is S = m ₁ ,m ₂ ,…,m _n For item m in the collection _i Finding out the m-related relation from the knowledge map according to the existing entity object alias dictionary base _i Corresponding candidate entity set E _i ＝e ₁ ,e ₂ ,…,e _T . Then the entire set of candidate entities in the user questionMay be expressed as E = E ₁ ∪E ₂ ∪…∪E _n . For one entity object e, similarity features between the entity object e and user questions and e natural features are extracted to construct a feature vector. A similarity score between entity e and the user question is then calculated. All candidate entities are sorted according to similarity scores, and an entity object set E of top-n is reserved _r As a result of the entity linking module.

In order to accomplish the entity link more accurately, the following features are used in the present invention:

1) Correlation characteristics of entity objects and user questions: in the invention, the semantic similarity between the entity object and the user question is mainly extracted. The semantic similarity is a binary classification model based on a BERT pre-training language model. And calculating the similarity probability between the entity object and the user question as a final score.

2) Relevance characteristics of entity object context and user questions: the distinction of entities of the same reference in the knowledge-graph is mainly represented by having different attribute values. The present invention therefore screens entity objects by comparing the degree of correlation between their associated triplets and the context in which they are said to be in question.

3) Intrinsic properties of entity objects: mainly refers to the in-degree and out-degree of the entity object in the knowledge graph. The entity with higher in-degree and out-degree is more closely associated with other entities in the knowledge graph and also contains more relevant information.

3. Candidate query path generation

Set E obtained for entity linking module _r Each entity object e in the system acquires the candidate query path set Q in a dynamic expansion mode, but the calculation speed is obviously reduced due to the fact that the situation of overlarge subgraphs may exist in the complex problem, so that the scale of the candidate path set Q is reduced by combining with a corresponding pruning strategy in the expansion process, and the effect and the efficiency are achieved at the same time. The specific expansion mode comprises the following steps:

1) Extracting the content e from the knowledge graph _r (e _r ∈E _r ) All triplets (e) of _r ,r ₁ X) forming a single-hop query path e _r r ₁ Set Q of x _onehop Where x is the entity object e _r Corresponding answers in the respective single-hop paths, r ₁ Is a relationship path between nodes.

2) For (e) _r ,r ₁ ,x)∈Q _onehop Expanding the corresponding query path based on the entity object x to obtain a triple combination (e) _r ,r ₁ ,x),(x,r ₂ ,e ₂ ) And then generating a multi-hop path { e) according to the pruning strategy a _r r ₁ x,xr ₂ e ₂ And taking the entity object x as an answer corresponding to the query path. Generating a multi-hop path { e) according to the pruning strategy b _r r ₁ x,xr ₂ y, and taking the entity object y as an answer corresponding to the query path, thereby generating a two-hop query path set Q _twohop 。

3) For { e _r r ₁ x,xr ₂ e ₂ }∈Q _twohop The entity object x is used as a query path for expansion, and a triple combination (e) can be further obtained _r ,r ₁ ,x),(x,r ₂ ,e ₂ ),(x,r ₃ ,e ₃ ) Then generating a multi-hop path { e) according to the pruning strategy a _r r ₁ x,xr ₂ e ₂ ,xr ₃ e ₃ And taking the entity object x as an answer corresponding to the query path. Generating a multi-hop path { e) according to the pruning strategy b _r r ₁ x,xr ₂ e ₂ ,xr ₃ y, and taking the entity object y as an answer corresponding to the query path, thereby generating a multi-hop query path set Q _threehop 。

4) And merging the obtained candidate query path sets to obtain all possible query paths Q: q = Q _onehop ∪Q _twohop ∪ _Qthreehop 。

The two pruning strategies used in the expansion phase are:

a) In the entity linking module, the entities of top-n are reserved, and n can be selected according to actual conditions, and is usually selected to be 3. Therefore, if the entity objects on the query path are all present in the entity link result set, the relevance of the query path and the problem is higher, and the relevance is preserved.

b) For a single entity multi-hop path, the multi-hop attributes are retained only if there is an intersection with the user problem at the character level. For example, for query path { e } _r r ₁ x,xr ₂ y, if the user question q and attribute r ₂ There is a common subsequence in between, that is

Adding the path into a two-hop query path set Q _twohop Otherwise, the path is discarded.

Based on the two strategies, a large number of multi-hop paths irrelevant to the user problem can be reduced, and therefore the query scale of the path subgraph is reduced.

Finally, the generated path subgraph candidates are expanded into a text sequence p in a rule-based mode.

4. Query path ranking

In this step, a semantic matching degree s (q, p) between the user question q and the text sequence p of the candidate subgraph is calculated. And (4) calculating the semantic similarity by adopting a BERT pre-training language model. And selecting the candidate path subgraph with the score meeting the following formula as the final answer of the user question.

Wherein S (q, p) represents the semantic similarity between the user question q and the path subgraph text sequence p,

called the optimal path in the candidate path subgraph, and P represents all path subgraph sequences.

In the invention, the BERT pre-training language model is adoptedAs an encoder. As shown in FIG. 3, the BERT model encodes the question end and the path end separately, [ CLS ] at the beginning position of the sentence]Output vector encoding h _q And h _p As an encoded representation of the user question and path subgraph text sequence, a similarity score is then calculated using cosine similarity, s (q, p) = cos (h) _q ,h _p ). Loss function using Margin Loss computation

loss＝max(0,γ+S(q,p ^- )-S(q,p ⁺ ))

Where gamma is the hyper-parameter of the model, p ^- And p ⁺ Respectively a correct path sub-graph and an incorrect path sub-graph.

During the model training process, hundreds or even thousands of candidate path subgraphs may be generated for one user problem. The problem of unbalance of positive and negative samples can occur when training is directly carried out, and the problem of diversity of negative samples can also occur. In order to solve the problem of unbalanced sample number, the number of negative samples is limited in the invention, and the ratio of the positive samples to the negative samples is controlled to be k:1. In order to alleviate the diversity problem, a training method of negative sample dynamic sampling is introduced. The negative sample dynamic sampling strategy keeps the balance of question training data of the user and improves the effect of a model obtained by training in an actual application scene. Negative sampling is to select a part of negative samples from the total number of negative samples through a certain strategy as training data. All negative samples are not used, mainly to reduce the training complexity of the model. And in the dynamic negative sampling idea, in the model training process, the sample is scored by using the sampling result of the previous round, the probability of negative sampling is modified through scoring, and the higher the score is, the higher the probability of sampling is.

FAQ library update

And after the question and answer are finished, inquiring whether the user effectively solves the question, if the user gives a positive answer, updating the corresponding user question and answer into the FAQ library, and when the same question is met in the subsequent process, directly searching the FAQ library to give the answer.

For the user who fails to extract the effective semantic element information from the user question, the user can reply as a big difficulty through a unified pocket bottom, for example, "sorry, i cannot answer your question yet".

As can be seen from the introduction, the intelligent answer processing scheme provided by the invention sequentially comprises the steps of user question similarity calculation, entity extraction, entity linkage, candidate path subgraph extraction, candidate path subgraph sequencing, answer generation and the like.

The semantic parsing model referred to in the present invention is shown in fig. 2. The user question is input into a BERT pre-training language model to obtain a semantic code representation result, and then the semantic code result is input into a question type classification model and an entity recognition model so as to obtain a user question category and an entity recognition result based on the semantic code result. Meanwhile, the user problem classification and named entity identification modules are not independent tasks, but are obtained based on the same semantic coding result, and semantic association exists among the modules. Therefore, the restriction and management are carried out among the modules, and the accuracy of module tasks can be further improved.

Based on the same inventive concept, another embodiment of the present invention provides an intelligent question answering device suitable for an open source software supply chain, including:

the question similarity measurement module is used for calculating the similarity between the questions input by the user and the questions in the predefined FAQ library, and if the similarity exceeds a certain threshold value, the answers of the corresponding questions in the FAQ library are directly returned to the user;

the question parsing module is used for parsing the questions input by the user through a pre-trained language model and extracting semantic information in the questions input by the user when the similarity between the questions input by the user and the questions in the FAQ library does not exceed the threshold;

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, implements the steps of the inventive method.

The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims

1. An intelligent question-answering method suitable for an open source software supply chain comprises the following steps:

calculating the similarity between the questions input by the user and the questions in a predefined FAQ library, and if the similarity exceeds a certain threshold, directly returning answers of the corresponding questions in the FAQ library to the user;

if the similarity between the questions input by the user and the questions in the FAQ library does not exceed the threshold, analyzing the questions input by the user through a pre-trained language model, and extracting semantic information in the questions input by the user;

and generating an answer corresponding to the question input by the user by combining the knowledge graph based on the question input by the user and the extracted semantic information.

2. The method of claim 1, wherein calculating the similarity of the user-entered question to questions in a predefined FAQ library comprises:

calculating the similarity through a BERT pre-training model, splicing the two question sentences through [ SEP ], and adding [ CLS ] at the head to finish the similarity calculation of the question sentences.

3. The method of claim 1, wherein parsing the user-entered question through the pre-trained language model to extract semantic information from the user-entered question comprises:

classifying the problems input by the user and determining the category of the problems;

carrying out named entity identification, and extracting entity objects contained in the questions input by the user;

and carrying out entity linking, and linking the extracted entity object to a corresponding node in the knowledge graph of the open source software supply chain.

4. The method of claim 1, wherein generating an answer corresponding to the question input by the user in combination with the knowledge graph based on the question input by the user and the extracted semantic information comprises:

and acquiring candidate path subgraphs related to the problems input by the user in the knowledge graph, scoring and sequencing the candidate path subgraphs according to the problems input by the user, and taking the candidate path subgraphs as answers of the problems input by the user if the scores of the candidate path subgraphs meet a certain threshold value.

5. The method of claim 4, wherein the resulting entity object set E is linked to an entity _r In the method, each entity object e acquires a candidate query path set Q in a dynamic expansion mode, and reduces the scale of the candidate path set Q by combining a corresponding pruning strategy in the expansion process, so that the effect and the efficiency are considered.

6. The method of claim 1, further comprising the step of FAQ library update: and after the answer of the question input by the user is obtained, prompting the user to evaluate whether the question is solved, and if the question is solved, adding the corresponding question and answer into an FAQ library.

7. An intelligent question answering device suitable for an open source software supply chain, comprising:

8. The apparatus of claim 7, further comprising an FAQ library update module for prompting a user to evaluate whether to solve a question after completion of a conversation, and if so, adding a corresponding question and answer to the FAQ library.

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.