CN110390050B

CN110390050B - Software development question-answer information automatic acquisition method based on deep semantic understanding

Info

Publication number: CN110390050B
Application number: CN201910620493.1A
Authority: CN
Inventors: 孙海龙; 王旭; 张振羽; 刘旭东
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2021-12-07
Anticipated expiration: 2039-07-10
Also published as: CN110390050A

Abstract

The invention provides a method for automatically acquiring software development question-answer information based on deep semantic understanding, which comprises the following steps of: step 1, searching n posts related to the question from a plurality of developer question and answer forum websites based on seven characteristics of the posts by using a text search engine; step 2, for question Q and n posts related to the question Q, constructing n data pairs of "< Q, post >", inputting a usefulness inference network KnowNet for inference prediction, and outputting the usefulness grade of each post and a probability value indicating the grade by the network; and 3, sorting all posts in the same usefulness grade according to the probability value based on the inference result of the usefulness inference network KnowNet, then sorting all the grades together from high to low, and finally selecting K posts with highest usefulness and returning the K posts to the questioner.

Description

Software development question-answer information automatic acquisition method based on deep semantic understanding

Technical Field

The invention relates to an automatic information acquisition method, in particular to a software development question-answer information automatic acquisition method based on deep semantic understanding.

Background

In the process of software development, developers usually encounter various software development problems such as Bug debugging, how to call API and the like. In order to solve these problems, people have established software development communities such as blogs and online question and answer forums on the internet. For example: stack Overflow, CSDN, etc. On these community platforms, any developer can ask questions, and other developers who can answer these questions can provide their own answers to the questions, and the questioner marks the answers that can solve the questions as accepted answers.

Although these communities provide developers with a platform that can discuss questions and share (obtain) answers online, there is no guarantee that questions can be answered in a timely manner, so many developers tend to gain the required knowledge by browsing related community posts in large numbers rather than asking questions directly. Due to participation and sharing of a large number of users, the development communities contain a large amount of knowledge for solving the problems of developers, but the knowledge is seriously fragmented, the community platform also lacks analysis on the relevance among the knowledge, and an effective information acquisition method is lacked to help the developers to quickly find the required information.

In order to help developers to efficiently acquire software development question-answer information, the prior art mainly includes the following two categories:

1) an expert is recommended that is suitable for answering a certain type of question. Generally, according to the characteristics of historical questions answered by community developers, the methods portray users of the developers and describe professional abilities of the developers, and therefore, for new questions, experts capable of answering the questions are recommended according to analysis of the content of the questions and analysis of the portraying abilities of the developers.

2) Relevant information is recommended for the problem. For example, according to the field characteristics of software development, information related to the API is retrieved, so that a developer is helped to understand and find the required API; some methods also perform bilingual searches to recommend relevant information for non-english developers. Recently, many methods analyze the relationship between question and answer posts and post features of community websites based on deep learning, and recommend related information based on the relationship.

Although the prior art can help developers to obtain more relevant information to a certain extent, the following two problems exist:

1) the expert who recommends the question only has the ability to answer the question, but does not mean that he has the time and effort to deal with the question in a timely manner, which still may result in the question not being answered in a timely manner;

2) at present, when relevant information is recommended for problems in the field of software development, a method based on retrieval only lacks deep understanding of semantics of the problems and relevant texts, and a semantic analysis method based on deep learning is poor in general semantic understanding capability due to poor models. Furthermore, "relevant information" is not equal to "useful information", and thus such methods lack modeling analysis of the usefulness of the search results.

In summary, there is no method for automatically reasoning about the usefulness of the retrieved related information, so that the problem cannot be provided with useful information in a targeted manner.

Disclosure of Invention

The invention provides a method for automatically acquiring software development question and answer information based on deep semantic understanding, which provides useful information acquisition service for software development automatic question and answer. The method is based on big data accumulated by a software development question-answering community, relevant post retrieval is carried out on given questions, the usefulness of each post in answering software development questions is inferred, and posts with high usefulness are returned to questioners.

The invention provides a method for automatically acquiring software development question-answer information based on deep semantic understanding, which comprises the following steps of:

step 1, searching n posts related to the question from a plurality of developer question and answer forum websites based on seven characteristics of the posts by using a text search engine. For each post, extracting a question text and an answer text to form a plain text description, replacing a mathematical expression with NUM, and replacing a CODE with CODE, thereby realizing fuzzy representation of the mathematical expression and the CODE semantics, and retaining text information as much as possible on the premise that the current industry research lacks mathematical understanding and CODE understanding capability.

And 2, constructing n data pairs of < Q, post >' for the question Q and n posts related to the question Q, inputting a usefulness inference network (KnowNet) for inference prediction, and outputting usefulness grades (four grades are respectively 0, 1, 2 and 3, and the usefulness is gradually reduced) of each post and probability values indicating the grades by the network.

And 3, sorting all posts in the same usefulness grade according to the probability value based on the inference result of the usefulness inference network KnowNet, then sorting all the grades together from high to low, and finally selecting K posts with highest usefulness and returning the K posts to the questioner.

Drawings

FIG. 1 is an overall flow chart of the present invention;

fig. 2 is a block diagram of a usefulness inference network knownnet.

FIG. 3 is a diagram illustrating the Graph analysis of post link according to an embodiment of the present invention

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

step 1, searching n posts related to the question from a plurality of developer question and answer forum websites based on seven characteristics of the posts by using a text search engine.

And 2, constructing n data pairs of < Q, post >' for the question Q and n posts related to the question Q, inputting a usefulness inference network (KnowNet) for inference prediction, and outputting usefulness grades (four grades are respectively 0, 1, 2 and 3, the usefulness is gradually reduced) of each post and probability values indicating the grades by the network, wherein n is a positive integer.

The 7 features that the text search engine processes for posts include the question title, the question body, the answer, the code, the Tag information, the number of votes and comments, and the time of the post.

The text search engine extracts and matches keywords aiming at the title of the question Q and the title of the question of the related post;

the question body is a detailed description of the question discussed in the post, and the text search engine needle adopts tf-idf to carry out vectorization processing and calculates the cosine similarity of the tf-idf vector of the question Q body;

the answer is a description in the post discussing how to solve the problem, and the text search engine aims at adopting a processing mode similar to the problem body.

The post contains a code related to the question Q, the text search engine eliminates a control symbol in the post and a common coincidence (such as 'if, else, while', and the like), extracts an API sequence in the post, and performs keyword matching on the processed API sequence and the API sequence of the code in the question Q;

the question of each post is endowed with a plurality of keywords for describing the topic of the question, and the text search engine carries out keyword matching on the topic of the question Q and the topic of the related post;

the vote count of a post represents the quality of the post, the comment count represents the popularity of the post, a post with good quality and high popularity is relatively more useful, and a text search engine aims at measuring the popularity characteristic of the post by adopting 'alpha x vote count + (1-alpha) x comment count' (alpha suggestion is set to be 0.8);

since the software update iterations are fast, the posts are time sensitive, and the text search engine aims to represent the time characteristics by taking the average of the post issuance time and the last active time (modify, answer, comment, etc.) and the time difference of the current question.

7 feature scores such as question titles, question bodies, answers, codes, Tag information, votes and comments, and post time are provided with dimensions, the text search engine normalizes each type of feature score by adopting a z-score normalization method (x' ═ mean)/std, wherein x is a feature original score, mean is a mean value of the feature, and std is a standard deviation of the feature), and uses the sum of seven feature scores as the score of each post, and then sorts the related posts to obtain n search results with the highest scores.

Figure 2 shows the knownnet network structure. The network is based on a BERT coding model, provides an attention mechanism based on question and answer data semantic vectors to perform feature pooling on coding results, and calculates a usefulness result by using a softmax layer. The model comprises 4 parts: word embedding, BERT encoder, heterogeneous attention, and softmax layers. The BERT encoder may be replaced by GPT-2, XLNET, etc.

The specific description is as follows:

in word Embedding, Position Embedding may map a Position index of a word to a vector representing the Position. Token Embedding maps a word to a vector representing its word shallow semantics. Segment Embedding maps the type of text (question type 0 and context type 1 in this model) to a vector representing type information. The three resulting vectors of representation for each word are finally summed as the output of this section.

The BERT encoder performs text semantic description on the weights obtained by unsupervised learning training on massive texts, each input vector can capture information of the input context through a self-attention mechanism of the BERT encoder, and finally each input word can be used for describing the context semantic of the word in the text. The knownnet network architecture uses a BERT encoder to encode the contextual semantics of the problem and context into corresponding vectors. In addition, the encoder outputs a special vector CLS, which can be used to represent the semantic features of the problem in knownnet.

And (4) an attention mechanism based on the question-answer data semantic vector. The KnowNet network architecture employs CLS and each vector of coding results (Q1, …, Q)_m,P1,…,P_n) As a score of the CLS and the attention of the vector, and encoding the resultant vector (Q1, …, Q)_m,P1,…,P_n) Attention score weighting is performed, ^ in FIG. 2 indicates an attention weighted encoding result vector to indicate a semantic relationship feature CLS between a question Q and a post_AttentionAnd CLS_AttentionConnected together represent the input features.

The Softmax layer has 4 neurons, and the neuron outputs are normalized using the Softmax function such that the output of each neuron corresponds to the probability of the corresponding level of usefulness, respectively. The knownnet network structure selects the maximum probability and its corresponding usefulness level as the final output of the knownnet.

In one embodiment, to train KnowNet to learn how to predict usefulness, a post link Graph is constructed, as shown in FIG. 3. Developer question and answer websites, such as Stack Overflow, are typically tagged by users with the relationship between posts: marking the link of the two repeated posts as an edge with the weight of 0; the remaining edges are labeled as edges with a weight of 1, representing that the user believes the post is useful for answering a question or references the content of the labeled post when answering the question. In fig. 3, there are 5 posts and labeled link relations (implementation representation), and the dotted line represents the relevant link complemented by Dijkstra algorithm; the two links with the distance of more than 3 (namely the edge with the weight more than 3 in the post link Graph) are irrelevant, so the omission in the post link Graph is not considered; it is clear that the smaller the weight of an edge in Graph is, the higher the usefulness is. And sampling a sufficient number of samples of 'post A + link + post B' from the link Graph, using the question of the post A and the whole content of the post B (or vice versa) as the input of the network, and using the link values of the question of the post A and the whole content of the post B as the labels of the usefulness levels to input into the network.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A software development question-answer information automatic acquisition method based on deep semantic understanding is characterized by comprising the following steps: step 1, searching n posts related to questions from a plurality of developer question and answer forum websites based on the characteristics of the posts by using a text search engine; step 2, constructing n posts related to the question Q and the n posts<Q, post>Inputting a usefulness inference network KnowNet to carry out inference prediction, and outputting a usefulness grade of each post and a probability value indicating the grade; step 3, based on the inference result of the usefulness inference network KnowNet, ranking each post in the same usefulness grade according to the probability value, then ranking all the grades together from high to low, and selecting K posts with highest usefulnessThe post is returned to the questioner; in the step 1, the characteristics of the text search engine for processing the posts comprise question titles, question bodies, answers, codes, Tag information, the number of votes and comments, and the time of the posts; in the step 1, a specific way of searching n posts related to the question is to normalize the feature scores by a z-score normalization method, where x' is (x-mean)/std, where x is a feature original score, mean is a mean of the feature, std is a standard deviation of the feature, and use the sum of the feature scores as a score of each post to sort the related posts, so as to obtain n search results with the highest scores; in said step 2, said usefulness inference network knownnet comprises word embedding, BERT encoder, heterogeneous attention and softmax layers; the word Embedding comprises Position Embedding, Token Embedding and Segment Embedding; wherein Position Embedding maps the Position index of a word to a vector representing the Position, Token Embedding maps the type of a text to a vector representing type information, and word Embedding sums three types of representing result vectors of each word as the output of the word Embedding; the BERT encoder carries out unsupervised learning training on massive texts to obtain weights, text semantic description is carried out, each input vector captures information of the input context through a self-attention mechanism of the BERT encoder, the BERT encoder encodes the context semantics of the problems and the contexts into corresponding vectors and outputs a vector CLS, and the vector CLS represents the semantic features of the problems; the self-attention mechanism of the BERT encoder adopts the dot product value of the CLS and each encoding result vector as the attention score of the CLS and the vector, and performs attention score weighting on the encoding result vector to obtain the semantic relation characteristic CLS between the question Q and the post_AttentionAnd CLS_AttentionConnected together as an input feature for the softmax layer.

2. The method of claim 1, wherein the softmax layer has 4 neurons, wherein the neuron outputs are normalized using a softmax function such that the output of each neuron corresponds to the probability of the corresponding usefulness level, and wherein the maximum probability and its corresponding usefulness level are selected as the final output of the knownnet.