CN113076127A

CN113076127A - Method, system, electronic device and medium for extracting question and answer content in programming environment

Info

Publication number: CN113076127A
Application number: CN202110449778.0A
Authority: CN
Inventors: 陈林; 赵恒辉; 李言辉
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-06
Anticipated expiration: 2041-04-25
Also published as: WO2022226714A1; CN113076127B

Abstract

The invention discloses a method, a system, electronic equipment and a medium for extracting question and answer contents in a programming environment, wherein the system comprises the following components: a data processing module for performing: preprocessing input network question and answer text data, removing useless information and performing word segmentation; an entity identification module to perform: performing entity recognition in the field of software engineering on the text processed by the data processing module; a document reading module to perform: inputting the text identified by the entity identification module into a neural network for document reading; a digest extraction module for performing: and extracting key contents in the question and answer text by using another neural network. The invention can extract the key content in the technical question and answer, reduce the browsing time of developers and improve the development efficiency of a programming field.

Description

Method, system, electronic device and medium for extracting question and answer content in programming environment

Technical Field

The invention relates to a method, a system, electronic equipment and a medium for extracting question and answer contents in a programming environment, and belongs to the technical field of internet.

Background

Software development is a flexible and challenging task, and developers need strong learning ability and problem solving ability. On the programming field, developers can frequently seek network help except for looking up a tool book when encountering problems, ask other developers who encounter similar problems, use solutions of other people for reference, avoid repeated labor and improve development efficiency. Therefore, the software question-answering community is gradually activated, and a platform which helps each other and records the problems is provided for developers.

Active developers on a technical question and answer platform are more and more, the active developers put forward questions to answer the questions and provide ideas for solving the questions for other developers who encounter similar questions, but not all the questions can be solved on the platform, and a large amount of redundant information and irrelevant information exist on the platform, so that the assistance for the developers is hindered. A question on the technical question-and-answer platform often corresponds to more than one answer, and there are cases where answers are irrelevant to the question, cases where answers are similar to each other in a repeated manner, and cases where some relevant parts are irrelevant and some parts are repeated in the answer. Much effort has also been made by the platform for these situations, such as Stack Overflow to allow users to score each answer to a question, and to allow answers with high scores to be seen by more people. This solves the interference of irrelevant information to some extent, but still has considerable limitations. If all answers under the same question are taken as a document, all answers are abstracted and marked with key contents, the function similar to 'highlight' can be achieved, the user is helped to shorten the browsing time, and the development efficiency of a programming field is improved.

Text summarization techniques may convert a text or a collection of texts into a short summary containing key information. The text abstract can be divided into an abstract type abstract and a generated abstract type abstract according to output types, wherein the abstract type abstract is an abstract formed by directly extracting a plurality of sentences from an original text and sequencing and recombining the sentences. The abstract is applied to the technical question-answering community, so that key contents in answers can be extracted, and developers can be helped to quickly locate the desired answer contents.

In recent years, scholars have proposed a number of methods for abstract extraction. Julian Kupiec et al propose that abstract extraction can be regarded as a classic classification problem, a series of training document data and an abstract result of manual extraction are given, a classifier is obtained through training, and the probability that a given sentence can be included in an abstract is obtained; conroy and O' Leary propose to use hidden Markov model to abstract and extract, and obtain the best effect compared with other models at that time; erkanand proposes a graph-based algorithm LexPageRank, and when the cosine similarity of two sentences exceeds a certain threshold value, a corresponding edge is added into a connection matrix, so that the importance of the sentences is calculated through the connection matrix; woodsend et al propose a model of joint content selection and compression for document summarization, which uses integer linear programming to select and combine terms to form a summary according to length, coverage and grammatical constraints; kageback et al compute the similarity between sentences by continuous vector space representation and extract the summary of the document using a recursive auto-encoder; yin et al project sentences to a continuous vector space through a Convolutional Neural Network (CNN), minimize costs based on "prestige" and "diversity", extract appropriate sentences, and achieve good effects in a multi-document extraction type summarization task; cao et al also solved the query-oriented multi-document summarization problem using CNN, they represented documents using weighted sum-posing on sentence representation basis, the weights being learned from the sentence-represented attention mechanism based on the query; cheng et al propose an automatic summarization framework based on hierarchical document encoders and attention mechanisms that can achieve a relatively robust summarization extraction without resorting to language labeling. However, the existing abstract extraction work is directed at the general field, and no scholars have provided technology and method for abstract extraction in the field of software engineering.

Disclosure of Invention

The invention aims to provide an automatic extraction system for key contents of technical questions and answers in a programming field, which can extract key contents in the technical questions and answers, reduce the browsing time of developers and improve the development efficiency of the programming field. The second purpose of the invention is to provide a method for automatically extracting key contents of the programming field technical question answering.

The invention specifically adopts the following technical scheme: the system for extracting the question and answer content in the programming environment comprises:

a data processing module for performing: preprocessing input network question and answer text data, removing useless information and performing word segmentation;

an entity identification module to perform: performing entity recognition in the field of software engineering on the text processed by the data processing module;

a document reading module to perform: inputting the text identified by the entity identification module into a neural network for document reading;

a digest extraction module for performing: and extracting key contents in the question and answer text by using another neural network.

As a preferred embodiment, the data processing module specifically executes the following steps: an initial state; processing code segments in the question and answer text; processing the HTML label; processing the URL; processing the emoticons; processing "@" information; utilizing an nltk tool to perform word segmentation; and finishing the data processing.

As a preferred embodiment, the entity identification module specifically executes the following steps: an initial state; calculating the spelling characteristics of the words, including whether the first letters of the words are capitalized, whether the words contain underlines and whether the words contain "-"; calculating to obtain the context characteristics of the words, specifically, adding two words in a window, namely the front word and the rear word, as the characteristics by using the window of [ -2,2 ]; calculating to obtain the bit stream characteristics of the words, specifically, clustering the words in similar distribution into a class by using an unlabeled text in the field of large-scale software engineering, wherein the class is represented by bit streams with different lengths as the characteristics; calculating to obtain the external dictionary features of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary or not; performing entity recognition by using a CRF model obtained by training a tool CRF + +; and finishing the entity recognition.

As a preferred embodiment, the document reading module specifically executes the following steps: an initial state; obtaining sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing reading the document; the abstract extraction module specifically executes and executes the following steps: an initial state; by using the idea of attention mechanism, a recurrent neural network is used for sequentially marking whether each sentence can be taken as an abstract or not; and (5) finishing abstract extraction.

The invention also provides a method for extracting the question and answer content in the programming environment, which comprises the following steps:

the data processing step specifically comprises the following steps: preprocessing input network question and answer text data, removing useless information and performing word segmentation;

the entity identification step specifically comprises the following steps: performing entity recognition in the field of software engineering on the text processed by the data processing module;

the document reading step specifically comprises the following steps: inputting the text identified by the entity identification module into a neural network for document reading;

the abstract extraction step specifically comprises the following steps: and extracting key contents in the question and answer text by using another neural network.

As a preferred embodiment, the data processing step specifically includes: an initial state; processing code segments in the question and answer text; processing the HTML label; processing the URL; processing the emoticons; processing "@" information; utilizing an nltk tool to perform word segmentation; and finishing the data processing.

As a preferred embodiment, the entity identifying step specifically includes: an initial state; calculating the spelling characteristics of the words, including whether the first letters of the words are capitalized, whether the words contain underlines and whether the words contain "-"; calculating to obtain the context characteristics of the words, specifically, adding two words in a window, namely the front word and the rear word, as the characteristics by using the window of [ -2,2 ]; calculating to obtain the bit stream characteristics of the words, specifically, clustering the words in similar distribution into a class by using an unlabeled text in the field of large-scale software engineering, wherein the class is represented by bit streams with different lengths as the characteristics; calculating to obtain the external dictionary features of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary or not; performing entity recognition by using a CRF model obtained by training a tool CRF + +; and finishing the entity recognition.

As a preferred embodiment, the document reading step specifically includes: an initial state; obtaining sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing reading the document; the abstract extracting step specifically comprises the following steps: an initial state; by using the idea of attention mechanism, a recurrent neural network is used for sequentially marking whether each sentence can be taken as an abstract or not; and (5) finishing abstract extraction.

The invention also proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method are implemented when the processor executes the program.

The invention also proposes a medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method.

The invention achieves the following beneficial effects: (1) the automatic extraction system for the key content of the technical question and answer in the programming field can extract the key content in the technical question and answer, reduce the browsing time of developers and improve the development efficiency of the programming field. (2) The automatic extraction system for the key content of the technical question and answer in the programming site can automatically extract the key content of the technical question and answer without manual marking, and greatly reduces the cost for extracting the key content. (3) The method for automatically extracting the key content of the question and answer of the programming field technology is a brand new attempt oriented to the field of software engineering, and fills the blank of the field of software engineering about extraction of the key content.

Drawings

FIG. 1 is a flow chart of a method for extracting question and answer content in a programming environment of the present invention.

Fig. 2 is a schematic diagram of an example of the structure of CNN of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1: the invention provides a question and answer content extraction system in a programming environment, which comprises:

Preferably, the data processing module specifically executes the following steps: an initial state; processing code segments in the question and answer text; processing the HTML label; processing the URL; processing the emoticons; processing "@" information; utilizing an nltk tool to perform word segmentation; and finishing the data processing.

Preferably, the specific execution of the entity identification module includes: an initial state; calculating the spelling characteristics of the words, including whether the first letters of the words are capitalized, whether the words contain underlines and whether the words contain "-"; calculating to obtain the context characteristics of the words, specifically, adding two words in a window, namely the front word and the rear word, as the characteristics by using the window of [ -2,2 ]; calculating to obtain the bit stream characteristics of the words, specifically, clustering the words in similar distribution into a class by using an unlabeled text in the field of large-scale software engineering, wherein the class is represented by bit streams with different lengths as the characteristics; calculating to obtain the external dictionary features of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary or not; performing entity recognition by using a CRF model obtained by training a tool CRF + +; and finishing the entity recognition.

Preferably, the specific execution of the document reading module includes: an initial state; obtaining sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing reading the document; the abstract extraction module specifically executes and executes the following steps: an initial state; by using the idea of attention mechanism, a recurrent neural network is used for sequentially marking whether each sentence can be taken as an abstract or not; and (5) finishing abstract extraction.

Example 2: the invention also provides a method for extracting the question and answer content in the programming environment, the general framework of the invention is shown in figure 1, and the method for extracting the question and answer content in the programming environment comprises the following 4 steps:

step 1: for the question and answer text on the network, firstly clearing the contents in all < pre > tags, because the code segments in the question and answer appear in the < pre > tags, and the clearing of the contents in the < pre > tags also clears the code segments; then all html tags are deleted, e.g., < pre > < p > < div >, etc.; then replacing URL appeared in the text with "@ u @", replacing the appeared expression such as "@ e @", and replacing the content of other users with "@ a @"; finally, the text is participled using the nltk participle tool, where the participle requires the API name as a whole, e.g., os.

Step 2: and performing entity recognition on the text after data processing. The entity recognition method mainly comprises a conditional random field model (CRF), wherein the model is realized on the basis of a tool CRF + +, and the characteristics of the CRF model comprise:

l characteristics in the spelling of the word. Such as whether the word first is capitalized, contains an underline, and contains ";

l contextual characteristics. Using a window of [ -2,2], adding two words in the window, namely the front word and the rear word, as a characteristic;

l bit stream characteristics of the word. The method comprises the steps of utilizing unlabeled texts in the field of large-scale software, adopting a Brown clustering algorithm, classifying words appearing in similar contexts into one class, setting the class number of the words to be 1000, and representing the words in the same class by using the same bit stream as a characteristic;

l external dictionary features. A large number of known entities are collected in advance to constitute an external dictionary, and it is checked whether or not a word exists in the external dictionary.

And step 3: and reading and coding the text identified by the entity. Firstly, a single-layer Convolutional Neural Network (CNN) is used to obtain a document expression vector at a sentence level; a vector representation of the document is then constructed using a Recurrent Neural Network (RNN). The CNN operates at the word level to obtain a sentence-level representation, which is then used as input to the RNN, which obtains a document-level representation in a hierarchical manner. The embedding dimensions of the words, sentences and documents are set to 150, 300 and 750 respectively.

In the single-layer convolutional neural network, for each convolution kernel, a series of features are obtained by calculation by using a plurality of feature maps, so that the number of the features is also 300 and is matched with the dimensionality of a sentence. And different feature representation vectors of sentences are obtained by using different convolution kernels with the dimensionality of 1-7, and finally the vectors are added to obtain final sentence vector representation. The lower part of fig. 2 is an example of a CNN structure. The dimension of the word is 5, the example sentences total 6 words, the two colors respectively represent two convolution kernels, the dimension of the blue convolution kernel is 2, the dimension of the red convolution kernel is 3, and the convolution kernels of the two dimensions respectively have 6 characteristic maps. Each feature map corresponds to one dimension in the final vector after pooling, so that two vectors with the dimension of 6 can be obtained through two convolution kernels, and the two vectors are summed to obtain the final sentence vector.

The Recurrent Neural Network (RNN) uses a single-layer long-short term memory recurrent neural network (LSTM) to solve the problem of gradient disappearance during long sentence training.

And 4, step 4: by using the idea of attention mechanism, a recurrent neural network is used to sequentially label whether each sentence can be used as key content, and the labeling process considers whether the sentences are independent from each other and whether the meaning is repeated. As shown in the upper right part of fig. 2, the labeling result of the next sentence depends not only on the current input but also on the labeling result of the previous sentence.

Example 3: the invention also proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method are implemented when the processor executes the program.

Example 4: the invention also proposes a medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. The question-answer content extraction system under the programming environment is characterized by comprising the following steps:

2. The system for extracting question and answer content under programming environment of claim 1, wherein the data processing module specifically executes the following steps: an initial state; processing code segments in the question and answer text; processing the HTML label; processing the URL; processing the emoticons; processing "@" information; utilizing an nltk tool to perform word segmentation; and finishing the data processing.

3. The system for extracting question and answer content in a programming environment according to claim 1, wherein the entity identification module specifically executes the following steps: an initial state; calculating the spelling characteristics of the words, including whether the first letters of the words are capitalized, whether the words contain underlines and whether the words contain "-"; calculating to obtain the context characteristics of the words, specifically, adding two words in a window, namely the front word and the rear word, as the characteristics by using the window of [ -2,2 ]; calculating to obtain the bit stream characteristics of the words, specifically, clustering the words in similar distribution into a class by using an unlabeled text in the field of large-scale software engineering, wherein the class is represented by bit streams with different lengths as the characteristics; calculating to obtain the external dictionary features of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary or not; performing entity recognition by using a CRF model obtained by training a tool CRF + +; and finishing the entity recognition.

4. The system for extracting question and answer content under programming environment of claim 1, wherein the document reading module specifically executes the following steps: an initial state; obtaining sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing reading the document; the abstract extraction module specifically executes the following steps: an initial state; by using the idea of attention mechanism, a recurrent neural network is used for sequentially marking whether each sentence can be taken as an abstract or not; and (5) finishing abstract extraction.

5. The method for extracting the question and answer content in the programming environment is characterized by comprising the following steps of:

6. The method for extracting question and answer content in programming environment according to claim 5, wherein the data processing step is specifically executed by: an initial state; processing code segments in the question and answer text; processing the HTML label; processing the URL; processing the emoticons; processing "@" information; utilizing an nltk tool to perform word segmentation; and finishing the data processing.

7. The method for extracting question and answer content in a programming environment according to claim 5, wherein the entity identification step is specifically executed by: an initial state; calculating the spelling characteristics of the words, including whether the first letters of the words are capitalized, whether the words contain underlines and whether the words contain "-"; calculating to obtain the context characteristics of the words, specifically, adding two words in a window, namely the front word and the rear word, as the characteristics by using the window of [ -2,2 ]; calculating to obtain the bit stream characteristics of the words, specifically, clustering the words in similar distribution into a class by using an unlabeled text in the field of large-scale software engineering, wherein the class is represented by bit streams with different lengths as the characteristics; calculating to obtain the external dictionary features of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary or not; performing entity recognition by using a CRF model obtained by training a tool CRF + +; and finishing the entity recognition.

8. The method for extracting question and answer content under programming environment of claim 5, wherein the document reading step is specifically executed and comprises: an initial state; obtaining sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing reading the document; the abstract extracting step specifically comprises the following steps: an initial state; by using the idea of attention mechanism, a recurrent neural network is used for sequentially marking whether each sentence can be taken as an abstract or not; and (5) finishing abstract extraction.

9. Electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 5 to 8 are implemented when the processor executes the program.

10. Medium, on which a computer program is stored, characterized in that the computer program realizes the steps of the method of any of claims 5 to 8 when executed by a processor.