CN113076127B

CN113076127B - Method, system, electronic device and medium for extracting question and answer content in programming environment

Info

Publication number: CN113076127B
Application number: CN202110449778.0A
Authority: CN
Inventors: 陈林; 赵恒辉; 李言辉
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2023-08-29
Anticipated expiration: 2041-04-25
Also published as: WO2022226714A1; CN113076127A

Abstract

The invention discloses a method, a system, an electronic device and a medium for extracting question and answer contents in a programming environment, wherein the system comprises the following steps: a data processing module for executing: preprocessing the input network question-answering text data, removing useless information and performing word segmentation; entity identification module for executing: entity identification in the field of software engineering is carried out on the text processed by the data processing module; a document reading module for performing: inputting the text recognized by the entity recognition module into a neural network for document reading; the abstract extraction module is used for executing the following steps: and extracting key contents in the question-answer text by using another neural network. The invention can extract the key content in the technical questions and answers, reduce the browsing time of developers and improve the on-site development efficiency of programming.

Description

Method, system, electronic device and medium for extracting question and answer content in programming environment

Technical Field

The invention relates to a method, a system, electronic equipment and a medium for extracting question and answer contents in a programming environment, and belongs to the technical field of Internet.

Background

Software development is a flexible and challenging task that developers need strong learning ability and ability to solve problems. In a programming field, a developer can look up a tool book when encountering problems, can also frequently search for network help, inquire other developers who encounter similar problems, reference solutions of other people, avoid repeated labor, and improve development efficiency. Therefore, the software question-answering community is gradually active and aims to provide a platform for the developer to help each other and record the problems.

Active developers on the technical question-answering platform are more and more, and the developers are provided with questions to answer the questions, and meanwhile, the solution thought is provided for other developers who encounter similar questions, but not all questions can be solved on the platform, and a large amount of redundant information and irrelevant information exist on the platform, so that the assistance seeking by the developers is hindered. A question on a technical question-and-answer platform often corresponds to more than one answer, and there are cases where the answer is not related to the question, cases where the answers are repeated similarly, and cases where the relevant part is not related and the answer is repeated partially in the answer. For these situations, the platform has also made a lot of effort, such as Stack Overflow, to give the user a score for each answer to the question, and to make the highest scoring answer as visible to more people. This solves to some extent the interference of extraneous information, but there are still considerable limitations. If all answers under the same question are regarded as a document, abstract extraction is carried out on all the answers, and key contents are marked, so that the method can play a role similar to 'highlighting', help users reduce browsing time and improve development efficiency of programming sites.

Text summarization techniques may convert text or a collection of text into a short summary containing key information. The text abstract can be divided into an extraction type abstract and a generation type abstract according to the output type, wherein the extraction type abstract is an abstract formed by directly extracting a plurality of sentences from an original text, and sequencing and recombining the sentences. The extraction type abstract is applied to a technical question-answer community, so that key contents in answers can be extracted, and a developer is helped to quickly locate the desired answer contents.

In recent years, scholars have proposed a number of methods for summary extraction. Julian Kupiec et al propose that abstract extraction can be regarded as classical classification problem, given a series of training document data and manually extracted abstract results, training to obtain a classifier, obtaining probability that a given sentence can be incorporated into the abstract; conroy and O' Leary propose to abstract with hidden Markov model, and get the best effect compared with other models at that time; the Erkanand proposes a graph-based algorithm LexPageRank, when the cosine similarity of two sentences exceeds a certain threshold value, a corresponding edge is added into a connection matrix, and then the importance of the sentences is calculated through the connection matrix; woodsend et al propose a model of joint content selection and compression for document summarization that uses integer linear programming to select and combine words to construct a summary based on length, coverage and grammatical constraints; kageback et al calculate similarity between sentences through successive vector space representations and abstract extraction of documents using a recursive automatic encoder; yin et al project sentences into a continuous vector space through a Convolutional Neural Network (CNN), minimize cost based on 'prestige' and 'diversity', extract proper sentences, and obtain good effects in multi-document extraction type abstract tasks; cao et al also solved the query-oriented multi-document summarization problem using CNN, they expressed documents using weighted sum-pooling based on sentence representation, weights were learned from the attention mechanism of the query-clause representation; cheng et al propose an automatic summary framework based on hierarchical document encoders and attention mechanisms that can achieve a fairly good summary extraction without the aid of language labeling. However, the current abstract extraction work is aimed at the general field, and no technology and method have been proposed by a learner for abstract extraction in the field of software engineering.

Disclosure of Invention

The first object of the invention is to provide an automatic extraction system for key contents of programming field technical questions and answers, which can extract the key contents in the technical questions and answers, reduce the browsing time of developers and improve the programming field development efficiency. The second object of the invention is to provide an automatic extraction method of the key contents of the programming field technology question-answer.

The invention adopts the following technical scheme: the system for extracting the question and answer content in the programming environment comprises:

a data processing module for executing: preprocessing the input network question-answering text data, removing useless information and performing word segmentation;

entity identification module for executing: entity identification in the field of software engineering is carried out on the text processed by the data processing module;

a document reading module for performing: inputting the text recognized by the entity recognition module into a neural network for document reading;

the abstract extraction module is used for executing the following steps: and extracting key contents in the question-answer text by using another neural network.

As a preferred embodiment, the data processing module specifically performs the steps of: an initial state; processing code segments in the question-answering text; processing the HTML tag; processing the URL; processing the expression symbol; processing the "@" information; word segmentation is carried out by using an ntk tool; and finishing the data processing.

As a preferred embodiment, the entity identification module specifically performs the following steps: an initial state; calculating to obtain spelling patterns of words, including whether the initial of the word is capitalized, whether the underline is included and whether the word is included; calculating to obtain the context characteristics of the words, specifically using a window of [ -2,2], and adding the words in the window, namely the front word and the rear word, as the characteristics; calculating to obtain bit stream characteristics of words, specifically using unlabeled texts in the field of large-scale software engineering, clustering similarly distributed words into one class by using a clustering method, and representing the class by bit streams with different lengths as characteristics; calculating to obtain the external dictionary characteristics of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary; performing entity identification by using a CRF model obtained by training a tool CRF++; and (5) finishing entity identification.

As a preferred embodiment, the document reading module specifically performs the steps of: an initial state; obtaining a sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing the document reading; the abstract extraction module specifically performs the following steps: an initial state; by taking the thought of the attention mechanism as a reference, a cyclic neural network is used for marking whether each sentence can be regarded as a summary or not in sequence; and (5) finishing abstract extraction.

The invention also provides a method for extracting the question and answer content in the programming environment, which comprises the following steps:

the data processing step specifically comprises the following steps: preprocessing the input network question-answering text data, removing useless information and performing word segmentation;

the entity identification step specifically comprises the following steps: entity identification in the field of software engineering is carried out on the text processed by the data processing module;

the document reading step specifically comprises the following steps: inputting the text recognized by the entity recognition module into a neural network for document reading;

the abstract extraction step specifically comprises the following steps: and extracting key contents in the question-answer text by using another neural network.

As a preferred embodiment, the data processing step specifically includes: an initial state; processing code segments in the question-answering text; processing the HTML tag; processing the URL; processing the expression symbol; processing the "@" information; word segmentation is carried out by using an ntk tool; and finishing the data processing.

As a preferred embodiment, the entity identification step specifically includes: an initial state; calculating to obtain spelling patterns of words, including whether the initial of the word is capitalized, whether the underline is included and whether the word is included; calculating to obtain the context characteristics of the words, specifically using a window of [ -2,2], and adding the words in the window, namely the front word and the rear word, as the characteristics; calculating to obtain bit stream characteristics of words, specifically using unlabeled texts in the field of large-scale software engineering, clustering similarly distributed words into one class by using a clustering method, and representing the class by bit streams with different lengths as characteristics; calculating to obtain the external dictionary characteristics of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary; performing entity identification by using a CRF model obtained by training a tool CRF++; and (5) finishing entity identification.

As a preferred embodiment, the document reading step specifically includes: an initial state; obtaining a sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing the document reading; the abstract extracting step comprises the following specific execution steps: an initial state; by taking the thought of the attention mechanism as a reference, a cyclic neural network is used for marking whether each sentence can be regarded as a summary or not in sequence; and (5) finishing abstract extraction.

The invention also proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method when executing the program.

The invention also proposes a medium on which a computer program is stored which, when being executed by a processor, implements the steps of the method.

The invention has the beneficial effects that: (1) The automatic extraction system for the key contents of the programming field technical questions and answers provided by the invention can extract the key contents in the technical questions and answers, reduce the browsing time of developers and improve the programming field development efficiency. (2) The automatic extraction system for the key contents of the technical questions and answers in the programming field can automatically extract the key contents of the technical questions and answers without manual labeling, and greatly reduces the extraction cost of the key contents. (3) The automatic extraction method of the key content of the programming field technology question and answer provided by the invention is a brand new attempt facing the field of software engineering, and fills the blank of the field of software engineering about key content extraction.

Drawings

Fig. 1 is a flowchart of a method of extracting question-answer contents in a programming environment of the present invention.

Fig. 2 is a schematic diagram of an example of the CNN structure of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Example 1: the invention provides a question and answer content extraction system in a programming environment, which comprises the following steps:

Preferably, the specific execution of the data processing module includes: an initial state; processing code segments in the question-answering text; processing the HTML tag; processing the URL; processing the expression symbol; processing the "@" information; word segmentation is carried out by using an ntk tool; and finishing the data processing.

Preferably, the entity identification module specifically performs the following steps: an initial state; calculating to obtain spelling patterns of words, including whether the initial of the word is capitalized, whether the underline is included and whether the word is included; calculating to obtain the context characteristics of the words, specifically using a window of [ -2,2], and adding the words in the window, namely the front word and the rear word, as the characteristics; calculating to obtain bit stream characteristics of words, specifically using unlabeled texts in the field of large-scale software engineering, clustering similarly distributed words into one class by using a clustering method, and representing the class by bit streams with different lengths as characteristics; calculating to obtain the external dictionary characteristics of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary; performing entity identification by using a CRF model obtained by training a tool CRF++; and (5) finishing entity identification.

Preferably, the document reading module specifically performs the steps of: an initial state; obtaining a sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing the document reading; the abstract extraction module specifically performs the following steps: an initial state; by taking the thought of the attention mechanism as a reference, a cyclic neural network is used for marking whether each sentence can be regarded as a summary or not in sequence; and (5) finishing abstract extraction.

Example 2: the invention also provides a method for extracting the question-answer content in the programming environment, the general framework of the invention is shown in figure 1, and the method for extracting the question-answer content in the programming environment comprises the following 4 steps:

step 1: for the question and answer text on the network, firstly clearing the content in all the < pre > tags, wherein the code segments in the question and answer are appeared in the < pre > tags, and the content in the < pre > tags is cleared, so that the code segments are cleared; all html tags are then deleted, e.g., < pre > < p > < div > etc.; next, the URL appearing in the text is replaced by "@ u@", the appearing expression such as ":" is replaced by "@ e@", and the appearing "@" content of other users is replaced by "@ a@"; finally, word segmentation of text using the nltk word segmentation tool requires that the API name as a whole, e.g., os.path.join (path) needs to be indistinguishable as a word.

Step 2: and carrying out entity recognition on the text after the data processing. The subject of the entity identification method is a conditional random field model (CRF) implemented based on a tool crf++, the features of the CRF model including:

l features on word spelling. Such as whether the word initials are uppercase, contain underlining, and contain ";

l contextual characteristics. Using a window of [ -2,2] to add the words in the window, namely the front word and the back word, as the characteristics;

bit stream characteristics of word. Classifying words appearing in similar contexts into one class by utilizing unlabeled texts in the large-scale software field and adopting a Brown clustering algorithm, setting the number of classes of the words to be 1000 altogether, and representing the words in the same class by using the same bit stream as a characteristic;

external dictionary features. A large number of known entities are collected in advance to constitute an external dictionary, and whether or not a word exists in the external dictionary is checked.

Step 3: and reading the text identified by the entity and encoding. First, a single-layer Convolutional Neural Network (CNN) is used to obtain sentence-level document representation vectors; a Recurrent Neural Network (RNN) is then used to construct a vector representation of the document. The CNN operates at the word level to obtain a sentence-level representation, which is then used as input to the RNN, which obtains the document-level representation in a hierarchical manner. The embedding dimensions of words, sentences, documents are set to 150, 300, 750, respectively.

In the single-layer convolutional neural network, for each convolutional kernel, a series of features are calculated by using a plurality of feature graphs, so that the number of the features is 300 as well and is matched with the dimension of a sentence. And, different convolution kernels with dimensions of 1-7 are used to obtain different feature representation vectors of the sentence, and finally the vectors are added to obtain the final sentence vector representation. As in the lower half of fig. 2, is an example of a CNN structure. The dimension of the word is 5, the total of 6 words in the illustrated sentence, the two colors respectively represent two convolution kernels, the dimension of the blue convolution kernel is 2, the dimension of the red convolution kernel is 3, and the convolution kernels of the two dimensions have 6 feature graphs respectively. Each feature map corresponds to one dimension in the final vector after pooling, so that a vector with two dimensions of 6 can be obtained through two convolution kernels, and the two vectors are summed to obtain the final sentence vector.

The Recurrent Neural Network (RNN) uses a single layer long and short term memory recurrent neural network (LSTM) to solve the gradient vanishing problem during long sentence training.

Step 4: by taking the thought of the attention mechanism into consideration, a cyclic neural network is used for marking whether each sentence can be regarded as key content or not in sequence, and the marking process can consider whether the sentences are independent of each other or whether the meanings are repeated or not. As shown in the upper right part of fig. 2 above, the labeling result of the next sentence depends not only on the current input but also on the labeling result of the previous sentence.

Example 3: the invention also proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method when executing the program.

Example 4: the invention also proposes a medium on which a computer program is stored which, when being executed by a processor, implements the steps of the method.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. The system for extracting the question and answer content in the programming environment is characterized by comprising the following steps:

entity identification module for executing: entity identification in the field of software engineering is carried out on the text processed by the data processing module; the entity identification module specifically performs the following steps: an initial state; calculating to obtain spelling patterns of words, including whether the initial of the word is capitalized, whether the underline is included and whether the word is included; calculating to obtain the context characteristics of the words, specifically using a window of [ -2,2], and adding the words in the window, namely the front word and the rear word, as the characteristics; calculating to obtain bit stream characteristics of words, specifically using unlabeled texts in the field of large-scale software engineering, clustering similarly distributed words into one class by using a clustering method, and representing the class by bit streams with different lengths as characteristics; calculating to obtain the external dictionary characteristics of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary; performing entity identification by using a CRF model obtained by training a tool CRF++; finishing entity identification;

2. The system for extracting question-answer content in a programming environment of claim 1, wherein the data processing module specifically performs the steps of: an initial state; processing code segments in the question-answering text; processing the HTML tag; processing the URL; processing the expression symbol; processing the "@" information; word segmentation is carried out by using an ntk tool; and finishing the data processing.

3. The system for extracting question-answer content in a programming environment of claim 1, wherein the document reading module specifically performs the steps of: an initial state; obtaining a sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing the document reading; the abstract extraction module specifically performs the following steps: an initial state; by taking the thought of the attention mechanism as a reference, a cyclic neural network is used for marking whether each sentence can be regarded as a summary or not in sequence; and (5) finishing abstract extraction.

4. The method for extracting the question and answer content in the programming environment is characterized by comprising the following steps:

the entity identification step specifically comprises the following steps: performing entity identification in the field of software engineering on the text processed by the data processing step; the entity identification step specifically comprises the following steps: an initial state; calculating to obtain spelling patterns of words, including whether the initial of the word is capitalized, whether the underline is included and whether the word is included; calculating to obtain the context characteristics of the words, specifically using a window of [ -2,2], and adding the words in the window, namely the front word and the rear word, as the characteristics; calculating to obtain bit stream characteristics of words, specifically using unlabeled texts in the field of large-scale software engineering, clustering similarly distributed words into one class by using a clustering method, and representing the class by bit streams with different lengths as characteristics; calculating to obtain the external dictionary characteristics of the words, specifically collecting a large number of known entities to form an external dictionary, and checking whether the words exist in the external dictionary; performing entity identification by using a CRF model obtained by training a tool CRF++; finishing entity identification;

the document reading step specifically comprises the following steps: inputting the text identified by the entity identification step into a neural network for document reading;

5. The method for extracting question-answer content in a programming environment according to claim 4, wherein the data processing step specifically comprises: an initial state; processing code segments in the question-answering text; processing the HTML tag; processing the URL; processing the expression symbol; processing the "@" information; word segmentation is carried out by using an ntk tool; and finishing the data processing.

6. The method for extracting question-answer contents in a programming environment according to claim 4, wherein the document reading step specifically comprises: an initial state; obtaining a sentence-level vector representation through a single-layer convolutional neural network with maximum pooling; converting the sentence-level vector representation into a document-level vector representation through a recurrent neural network; finishing the document reading; the abstract extracting step specifically comprises the following steps: an initial state; by taking the thought of the attention mechanism as a reference, a cyclic neural network is used for marking whether each sentence can be regarded as a summary or not in sequence; and (5) finishing abstract extraction.

7. Electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 4 to 6 when the program is executed.

8. A medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 4 to 6.