CN110390050B - Software development question-answer information automatic acquisition method based on deep semantic understanding - Google Patents

Software development question-answer information automatic acquisition method based on deep semantic understanding Download PDF

Info

Publication number
CN110390050B
CN110390050B CN201910620493.1A CN201910620493A CN110390050B CN 110390050 B CN110390050 B CN 110390050B CN 201910620493 A CN201910620493 A CN 201910620493A CN 110390050 B CN110390050 B CN 110390050B
Authority
CN
China
Prior art keywords
question
posts
usefulness
post
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910620493.1A
Other languages
Chinese (zh)
Other versions
CN110390050A (en
Inventor
孙海龙
王旭
张振羽
刘旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201910620493.1A priority Critical patent/CN110390050B/en
Publication of CN110390050A publication Critical patent/CN110390050A/en
Application granted granted Critical
Publication of CN110390050B publication Critical patent/CN110390050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for automatically acquiring software development question-answer information based on deep semantic understanding, which comprises the following steps of: step 1, searching n posts related to the question from a plurality of developer question and answer forum websites based on seven characteristics of the posts by using a text search engine; step 2, for question Q and n posts related to the question Q, constructing n data pairs of "< Q, post >", inputting a usefulness inference network KnowNet for inference prediction, and outputting the usefulness grade of each post and a probability value indicating the grade by the network; and 3, sorting all posts in the same usefulness grade according to the probability value based on the inference result of the usefulness inference network KnowNet, then sorting all the grades together from high to low, and finally selecting K posts with highest usefulness and returning the K posts to the questioner.

Description

Software development question-answer information automatic acquisition method based on deep semantic understanding
Technical Field
The invention relates to an automatic information acquisition method, in particular to a software development question-answer information automatic acquisition method based on deep semantic understanding.
Background
In the process of software development, developers usually encounter various software development problems such as Bug debugging, how to call API and the like. In order to solve these problems, people have established software development communities such as blogs and online question and answer forums on the internet. For example: stack Overflow, CSDN, etc. On these community platforms, any developer can ask questions, and other developers who can answer these questions can provide their own answers to the questions, and the questioner marks the answers that can solve the questions as accepted answers.
Although these communities provide developers with a platform that can discuss questions and share (obtain) answers online, there is no guarantee that questions can be answered in a timely manner, so many developers tend to gain the required knowledge by browsing related community posts in large numbers rather than asking questions directly. Due to participation and sharing of a large number of users, the development communities contain a large amount of knowledge for solving the problems of developers, but the knowledge is seriously fragmented, the community platform also lacks analysis on the relevance among the knowledge, and an effective information acquisition method is lacked to help the developers to quickly find the required information.
In order to help developers to efficiently acquire software development question-answer information, the prior art mainly includes the following two categories:
1) an expert is recommended that is suitable for answering a certain type of question. Generally, according to the characteristics of historical questions answered by community developers, the methods portray users of the developers and describe professional abilities of the developers, and therefore, for new questions, experts capable of answering the questions are recommended according to analysis of the content of the questions and analysis of the portraying abilities of the developers.
2) Relevant information is recommended for the problem. For example, according to the field characteristics of software development, information related to the API is retrieved, so that a developer is helped to understand and find the required API; some methods also perform bilingual searches to recommend relevant information for non-english developers. Recently, many methods analyze the relationship between question and answer posts and post features of community websites based on deep learning, and recommend related information based on the relationship.
Although the prior art can help developers to obtain more relevant information to a certain extent, the following two problems exist:
1) the expert who recommends the question only has the ability to answer the question, but does not mean that he has the time and effort to deal with the question in a timely manner, which still may result in the question not being answered in a timely manner;
2) at present, when relevant information is recommended for problems in the field of software development, a method based on retrieval only lacks deep understanding of semantics of the problems and relevant texts, and a semantic analysis method based on deep learning is poor in general semantic understanding capability due to poor models. Furthermore, "relevant information" is not equal to "useful information", and thus such methods lack modeling analysis of the usefulness of the search results.
In summary, there is no method for automatically reasoning about the usefulness of the retrieved related information, so that the problem cannot be provided with useful information in a targeted manner.
Disclosure of Invention
The invention provides a method for automatically acquiring software development question and answer information based on deep semantic understanding, which provides useful information acquisition service for software development automatic question and answer. The method is based on big data accumulated by a software development question-answering community, relevant post retrieval is carried out on given questions, the usefulness of each post in answering software development questions is inferred, and posts with high usefulness are returned to questioners.
The invention provides a method for automatically acquiring software development question-answer information based on deep semantic understanding, which comprises the following steps of:
step 1, searching n posts related to the question from a plurality of developer question and answer forum websites based on seven characteristics of the posts by using a text search engine. For each post, extracting a question text and an answer text to form a plain text description, replacing a mathematical expression with NUM, and replacing a CODE with CODE, thereby realizing fuzzy representation of the mathematical expression and the CODE semantics, and retaining text information as much as possible on the premise that the current industry research lacks mathematical understanding and CODE understanding capability.
And 2, constructing n data pairs of < Q, post >' for the question Q and n posts related to the question Q, inputting a usefulness inference network (KnowNet) for inference prediction, and outputting usefulness grades (four grades are respectively 0, 1, 2 and 3, and the usefulness is gradually reduced) of each post and probability values indicating the grades by the network.
And 3, sorting all posts in the same usefulness grade according to the probability value based on the inference result of the usefulness inference network KnowNet, then sorting all the grades together from high to low, and finally selecting K posts with highest usefulness and returning the K posts to the questioner.
Drawings
FIG. 1 is an overall flow chart of the present invention;
fig. 2 is a block diagram of a usefulness inference network knownnet.
FIG. 3 is a diagram illustrating the Graph analysis of post link according to an embodiment of the present invention
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a method for automatically acquiring software development question-answer information based on deep semantic understanding, which comprises the following steps of:
step 1, searching n posts related to the question from a plurality of developer question and answer forum websites based on seven characteristics of the posts by using a text search engine.
And 2, constructing n data pairs of < Q, post >' for the question Q and n posts related to the question Q, inputting a usefulness inference network (KnowNet) for inference prediction, and outputting usefulness grades (four grades are respectively 0, 1, 2 and 3, the usefulness is gradually reduced) of each post and probability values indicating the grades by the network, wherein n is a positive integer.
And 3, sorting all posts in the same usefulness grade according to the probability value based on the inference result of the usefulness inference network KnowNet, then sorting all the grades together from high to low, and finally selecting K posts with highest usefulness and returning the K posts to the questioner.
The 7 features that the text search engine processes for posts include the question title, the question body, the answer, the code, the Tag information, the number of votes and comments, and the time of the post.
The text search engine extracts and matches keywords aiming at the title of the question Q and the title of the question of the related post;
the question body is a detailed description of the question discussed in the post, and the text search engine needle adopts tf-idf to carry out vectorization processing and calculates the cosine similarity of the tf-idf vector of the question Q body;
the answer is a description in the post discussing how to solve the problem, and the text search engine aims at adopting a processing mode similar to the problem body.
The post contains a code related to the question Q, the text search engine eliminates a control symbol in the post and a common coincidence (such as 'if, else, while', and the like), extracts an API sequence in the post, and performs keyword matching on the processed API sequence and the API sequence of the code in the question Q;
the question of each post is endowed with a plurality of keywords for describing the topic of the question, and the text search engine carries out keyword matching on the topic of the question Q and the topic of the related post;
the vote count of a post represents the quality of the post, the comment count represents the popularity of the post, a post with good quality and high popularity is relatively more useful, and a text search engine aims at measuring the popularity characteristic of the post by adopting 'alpha x vote count + (1-alpha) x comment count' (alpha suggestion is set to be 0.8);
since the software update iterations are fast, the posts are time sensitive, and the text search engine aims to represent the time characteristics by taking the average of the post issuance time and the last active time (modify, answer, comment, etc.) and the time difference of the current question.
7 feature scores such as question titles, question bodies, answers, codes, Tag information, votes and comments, and post time are provided with dimensions, the text search engine normalizes each type of feature score by adopting a z-score normalization method (x' ═ mean)/std, wherein x is a feature original score, mean is a mean value of the feature, and std is a standard deviation of the feature), and uses the sum of seven feature scores as the score of each post, and then sorts the related posts to obtain n search results with the highest scores.
Figure 2 shows the knownnet network structure. The network is based on a BERT coding model, provides an attention mechanism based on question and answer data semantic vectors to perform feature pooling on coding results, and calculates a usefulness result by using a softmax layer. The model comprises 4 parts: word embedding, BERT encoder, heterogeneous attention, and softmax layers. The BERT encoder may be replaced by GPT-2, XLNET, etc.
The specific description is as follows:
in word Embedding, Position Embedding may map a Position index of a word to a vector representing the Position. Token Embedding maps a word to a vector representing its word shallow semantics. Segment Embedding maps the type of text (question type 0 and context type 1 in this model) to a vector representing type information. The three resulting vectors of representation for each word are finally summed as the output of this section.
The BERT encoder performs text semantic description on the weights obtained by unsupervised learning training on massive texts, each input vector can capture information of the input context through a self-attention mechanism of the BERT encoder, and finally each input word can be used for describing the context semantic of the word in the text. The knownnet network architecture uses a BERT encoder to encode the contextual semantics of the problem and context into corresponding vectors. In addition, the encoder outputs a special vector CLS, which can be used to represent the semantic features of the problem in knownnet.
And (4) an attention mechanism based on the question-answer data semantic vector. The KnowNet network architecture employs CLS and each vector of coding results (Q1, …, Q)m,P1,…,Pn) As a score of the CLS and the attention of the vector, and encoding the resultant vector (Q1, …, Q)m,P1,…,Pn) Attention score weighting is performed, ^ in FIG. 2 indicates an attention weighted encoding result vector to indicate a semantic relationship feature CLS between a question Q and a postAttentionAnd CLSAttentionConnected together represent the input features.
The Softmax layer has 4 neurons, and the neuron outputs are normalized using the Softmax function such that the output of each neuron corresponds to the probability of the corresponding level of usefulness, respectively. The knownnet network structure selects the maximum probability and its corresponding usefulness level as the final output of the knownnet.
In one embodiment, to train KnowNet to learn how to predict usefulness, a post link Graph is constructed, as shown in FIG. 3. Developer question and answer websites, such as Stack Overflow, are typically tagged by users with the relationship between posts: marking the link of the two repeated posts as an edge with the weight of 0; the remaining edges are labeled as edges with a weight of 1, representing that the user believes the post is useful for answering a question or references the content of the labeled post when answering the question. In fig. 3, there are 5 posts and labeled link relations (implementation representation), and the dotted line represents the relevant link complemented by Dijkstra algorithm; the two links with the distance of more than 3 (namely the edge with the weight more than 3 in the post link Graph) are irrelevant, so the omission in the post link Graph is not considered; it is clear that the smaller the weight of an edge in Graph is, the higher the usefulness is. And sampling a sufficient number of samples of 'post A + link + post B' from the link Graph, using the question of the post A and the whole content of the post B (or vice versa) as the input of the network, and using the link values of the question of the post A and the whole content of the post B as the labels of the usefulness levels to input into the network.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (2)

1. A software development question-answer information automatic acquisition method based on deep semantic understanding is characterized by comprising the following steps: step 1, searching n posts related to questions from a plurality of developer question and answer forum websites based on the characteristics of the posts by using a text search engine; step 2, constructing n posts related to the question Q and the n posts<Q, post>Inputting a usefulness inference network KnowNet to carry out inference prediction, and outputting a usefulness grade of each post and a probability value indicating the grade; step 3, based on the inference result of the usefulness inference network KnowNet, ranking each post in the same usefulness grade according to the probability value, then ranking all the grades together from high to low, and selecting K posts with highest usefulnessThe post is returned to the questioner; in the step 1, the characteristics of the text search engine for processing the posts comprise question titles, question bodies, answers, codes, Tag information, the number of votes and comments, and the time of the posts; in the step 1, a specific way of searching n posts related to the question is to normalize the feature scores by a z-score normalization method, where x' is (x-mean)/std, where x is a feature original score, mean is a mean of the feature, std is a standard deviation of the feature, and use the sum of the feature scores as a score of each post to sort the related posts, so as to obtain n search results with the highest scores; in said step 2, said usefulness inference network knownnet comprises word embedding, BERT encoder, heterogeneous attention and softmax layers; the word Embedding comprises Position Embedding, Token Embedding and Segment Embedding; wherein Position Embedding maps the Position index of a word to a vector representing the Position, Token Embedding maps the type of a text to a vector representing type information, and word Embedding sums three types of representing result vectors of each word as the output of the word Embedding; the BERT encoder carries out unsupervised learning training on massive texts to obtain weights, text semantic description is carried out, each input vector captures information of the input context through a self-attention mechanism of the BERT encoder, the BERT encoder encodes the context semantics of the problems and the contexts into corresponding vectors and outputs a vector CLS, and the vector CLS represents the semantic features of the problems; the self-attention mechanism of the BERT encoder adopts the dot product value of the CLS and each encoding result vector as the attention score of the CLS and the vector, and performs attention score weighting on the encoding result vector to obtain the semantic relation characteristic CLS between the question Q and the postAttentionAnd CLSAttentionConnected together as an input feature for the softmax layer.
2. The method of claim 1, wherein the softmax layer has 4 neurons, wherein the neuron outputs are normalized using a softmax function such that the output of each neuron corresponds to the probability of the corresponding usefulness level, and wherein the maximum probability and its corresponding usefulness level are selected as the final output of the knownnet.
CN201910620493.1A 2019-07-10 2019-07-10 Software development question-answer information automatic acquisition method based on deep semantic understanding Active CN110390050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910620493.1A CN110390050B (en) 2019-07-10 2019-07-10 Software development question-answer information automatic acquisition method based on deep semantic understanding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910620493.1A CN110390050B (en) 2019-07-10 2019-07-10 Software development question-answer information automatic acquisition method based on deep semantic understanding

Publications (2)

Publication Number Publication Date
CN110390050A CN110390050A (en) 2019-10-29
CN110390050B true CN110390050B (en) 2021-12-07

Family

ID=68286374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910620493.1A Active CN110390050B (en) 2019-07-10 2019-07-10 Software development question-answer information automatic acquisition method based on deep semantic understanding

Country Status (1)

Country Link
CN (1) CN110390050B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488441B (en) * 2020-04-08 2023-08-01 北京百度网讯科技有限公司 Question analysis method and device, knowledge graph question answering system and electronic equipment
CN112765326B (en) * 2021-01-27 2023-04-21 西安电子科技大学 Question-answering community expert recommendation method, system and application

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991161A (en) * 2017-03-31 2017-07-28 北京字节跳动科技有限公司 A kind of method for automatically generating open-ended question answer
CN109977428A (en) * 2019-03-29 2019-07-05 北京金山数字娱乐科技有限公司 A kind of method and device that answer obtains

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11836611B2 (en) * 2017-07-25 2023-12-05 University Of Massachusetts Method for meta-level continual learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991161A (en) * 2017-03-31 2017-07-28 北京字节跳动科技有限公司 A kind of method for automatically generating open-ended question answer
CN109977428A (en) * 2019-03-29 2019-07-05 北京金山数字娱乐科技有限公司 A kind of method and device that answer obtains

Also Published As

Publication number Publication date
CN110390050A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN108021616B (en) Community question-answer expert recommendation method based on recurrent neural network
CN110427463B (en) Search statement response method and device, server and storage medium
CN109635083B (en) Document retrieval method for searching topic type query in TED (tele) lecture
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN110390049B (en) Automatic answer generation method for software development questions
CN109614480B (en) Method and device for generating automatic abstract based on generation type countermeasure network
CN112328800A (en) System and method for automatically generating programming specification question answers
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN110597968A (en) Reply selection method and device
CN112463944A (en) Retrieval type intelligent question-answering method and device based on multi-model fusion
CN110390050B (en) Software development question-answer information automatic acquisition method based on deep semantic understanding
CN113282711A (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN113011172A (en) Text processing method and device, computer equipment and storage medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN112035629B (en) Method for implementing question-answer model based on symbolized knowledge and neural network
CN115795018B (en) Multi-strategy intelligent search question-answering method and system for power grid field
CN112836027A (en) Method for determining text similarity, question answering method and question answering system
CN113741759B (en) Comment information display method and device, computer equipment and storage medium
CN114329181A (en) Question recommendation method and device and electronic equipment
CN113688633A (en) Outline determination method and device
CN113821610A (en) Information matching method, device, equipment and storage medium
CN113569124A (en) Medical title matching method, device, equipment and storage medium
CN111159366A (en) Question-answer optimization method based on orthogonal theme representation
CN114969291B (en) Automatic question and answer method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant