CN114398905A

CN114398905A - Crowd-sourcing-oriented problem and solution automatic extraction method, corresponding storage medium and electronic device

Info

Publication number: CN114398905A
Application number: CN202210002150.0A
Authority: CN
Inventors: 石琳; 江子攸; 王青
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-26

Abstract

The invention provides an automatic extraction method for a crowd-sourcing-oriented problem and a solution, a corresponding storage medium and an electronic device. The method is based on a customized enhanced natural language processing deep learning technique. Specifically, the technique involves two basic tasks: 1) decoupling conversations of the real-time chat logs, and automatically decomposing time-sequentially arranged linear texts into independent conversations by using a data preprocessing technology and a candidate feedforward neural network; 2) a new problem-solution prediction network is used to extract problems and solutions, and the network comprises a statement coding layer, a context-dependent statement coding layer and an output layer, so that a problem solution knowledge base in a corpus is constructed. According to the invention, a complex rule set does not need to be constructed for extraction, the full-automatic recommendation of a problem-solution scheme can be realized, and experiments prove that the crowd-sourcing model can promote knowledge sharing and improve problem solution efficiency, thereby promoting software development based on chat communities.

Description

Crowd-sourcing-oriented problem and solution automatic extraction method, corresponding storage medium and electronic device

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an automatic extraction method for crowd-sourcing-oriented problems and solutions, a corresponding storage medium and an electronic device.

Background

With the continuous development of online chat platforms, compared with asynchronous communication modes such as e-mails or forums, synchronous communication is performed through real-time chat, so that developers can more efficiently seek information and technical support, share opinions and ideas, and discuss problems in the development process. Thus, real-time chat has become an integral part of most software development processes, not only for the purpose of forming an open source community of globally distributed developers, but also for software companies, online chat facilitates internal team communication and coordination, particularly in accommodating remote work brought by the COVID-19 pandemic. The real-time chat platform can be used for solving various problems in software development, such as installation and setting, bug solving, building and compiling and the like. Developers may ask questions related to certain specific questions and rely on others' answers to provide potential solutions.

Automated "problem-solution" extraction techniques have been extensively studied, such as the Casper method based on SVM, DECA based on rule sets, CNC based on CNN networks, and the UIT of context classifiers, among others. However, none of these methods analyze the following three challenges in mining real-time chat: (1) a coupled dialog. Real-time chat data is very voluminous and multiple concurrent discussions of different problems often exist in an interleaved fashion; (2) expensive labor costs. Chat logs are typically large numbers of inclusive informal conversations involving a wide range of technologies and complex topics; (3) and (4) noise data. There are duplicate and unreadable messages in the chat log that do not provide valuable information. These problems affect the accuracy and efficiency of extraction, and are not suitable for wide popularization and application in the industry.

Disclosure of Invention

Aiming at the problems, the automatic extraction technology for the crowd-sourcing-oriented problem and solution provided by the invention aims to automatically extract a large number of problem-solution pairs from a complex community real-time chat text through natural language processing and information extraction technologies, so that a difficult problem knowledge base existing in the development process is expanded, and the aim of automatically recommending solutions according to historical experience on an online question-and-answer platform is fulfilled.

The invention relates to an automatic extraction method for a crowd-sourcing-oriented problem and a solution, which comprises the following steps:

decoupling conversations of the real-time chat logs, and decomposing linear texts arranged in time sequence into independent conversations;

and extracting the problems and the solutions from the decomposed conversation by using a new problem-solution prediction network, and constructing a problem and solution knowledge base in the corpus by using the extracted problems and solutions.

Further, the decoupling of the dialogs of the real-time chat log comprises the steps of data preprocessing through text analysis and splitting of the dialogs using a dialogue decoupling model.

Further, the data preprocessing comprises the following steps:

1) capturing linear text data in online platform texts by using a crawler, and collecting chat records of a certain duration through a chat platform which is divided by projects and organized by time sequence, such as a Gitter;

2) the conversation is divided into words, and low-frequency words are replaced by specific symbols, so that interference is reduced;

3) replacing emoticons in the vocabulary text with standard regular character strings;

4) and calculating the consistency of adjacent sentences by using a Baidu artificial intelligence Cloud (Baidu AI Cloud) and utilizing the confusion index, and combining the adjacent sentences of which the confusion is lower than a set threshold (such as 40) into a new sentence.

Furthermore, the linear feedforward neural network containing 2-layer and 512-dimensional hidden layer vectors is selected for the conversation decoupling model, the network has the optimal testing effect on the online chat conversation decoupling data set with the sample size of 77563, and the accuracy rate of 74.9% and the recall rate of 79.7% can be achieved.

Further, the "problem-solution" predictive network contains a statement coding layer, a context dependent statement coding layer, and an output layer.

Further, the statement coding layer, its components include:

1) the BERT model used for coding the statement is pre-trained on a 2500M text, and fine-tuned on the decoupled dialogue data;

2) the triple used for context coding gathers the k adjacent sentences of the corresponding sentence and the context into an independent window vector and is used for the subsequent dialogue coding.

Further, the context-dependent sentence coding layer uses three feature extractors to extract codes containing context information of the dialog and feature information of the sentence, and the three feature extractors include:

1) a text feature extractor based on a convolutional network utilizes three layers of convolution and a maximum pooling layer to reduce the original sentence codes while maintaining the sentence semantics;

2) the heuristic characteristic extractor based on the attribute comprises heuristic characteristic codes of key words, structures, themes, emotions and roles and is used for extracting high-level semantic information of the sentences;

3) the context feature extractor based on the triples acquires the weight codes by using a local attention mechanism so as to capture the semantic information of the context.

Further, the output layer, its modules, uses the concatenated text feature vector, heuristic feature vector and context feature vector, using two fully connected layers (FC)₁,FC₂) Predicting whether it is a problem and a solution, respectively.

A storage medium having a computer program stored therein, wherein the computer program performs the above method.

An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the above method.

Compared with the prior art, the invention has the advantages that:

the invention can realize the automation and the intellectualization of the problem of the open source community chat system and the extraction of the solution.

The method does not need to use complex rules for extraction, has cross-domain self-adaption capability, and reduces the overhead of a problem-solution extraction algorithm.

The invention proves that the designed 'problem-solution' extraction algorithm has higher accuracy, recall rate and harmonic mean value by testing on the text data sets of eight main representative projects.

The knowledge base constructed by the invention can cover most of possible unsolved problems, and is beneficial to reuse of knowledge and automatic solution recommendation.

The invention separates independent dialog from complex linear text by understanding online chatting document described by natural language, and uses shared text feature coding, heuristic feature coding and context feature coding layer to solve problems and problems of solution prediction, based on semantic analysis and text mining, simplifies prediction task, and more accurately positions the position of 'problem-solution' pair. The automatic extraction algorithm can better avoid the interference of noise data, reduces the cost of manual extraction, has a higher F1 index evaluation result, and has higher industrial value because the model analyzes and completes the recommendation on a plurality of project indexes.

Drawings

FIG. 1 shows a flow chart of the present invention model session decoupling.

FIG. 2 shows a hierarchical flow diagram of model prediction in accordance with the present invention.

FIG. 3 shows a flow chart of the application of the model of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

The invention provides a method for automatically extracting problem-solution of an open source community and constructing a domain knowledge base, which is used for constructing a dialogue sample based on a plain text by utilizing a preprocessing technology after semantic analysis and natural language processing are finished. The problem and solution labels are then located through the shared multi-layer coding and prediction model. And finally, integrating all predicted 'problem-solution' to construct a complete question-answer knowledge base. The present invention is further illustrated by the following specific embodiments.

Fig. 1 is a block diagram illustrating the dialogue decoupling of the present invention model. Comprises four main steps: spell checking, low frequency word replacement, acronyms and emoticon replacement, and decoupled conversations.

Step 1.1 first spell check is performed, i.e. all text is collated, replacing potentially misbehaving words and tenses with standard vocabulary.

Step 1.2, replacing low-frequency words with uniform characteristics and special types by specific wildcards, usually selecting identifiers with "[ ]" symbols to replace the text, and mainly listing five common low-frequency words: a uniform resource locator ([ URL ]), an EMAIL ([ EMAIL ]), a web link ([ HTML ]), a source CODE ([ CODE ]), and identity information ([ ID ]).

Step 1.3 replaces the commonly used abbreviation text (e.g.: IDK → I Don't knock) with a standard abbreviation list, while replacing the special Unicode-encoded emoticons with standard ASCII characters (e.g.:

based on such specific alternatives, a plain ASCII encoded document may be constructed for training.

Step 1.4, firstly, using Baidu artificial intelligence Cloud (Baidu AI Cloud), calculating the consistency of adjacent sentences by using a confusion index, combining the adjacent sentences with the confusion lower than 40 into a new sentence, and secondly, selecting a dialogue decoupling model f of a multilayer feedforward network to decouple the original mixed dialogue into an independent dialogue set:

f：[u₁，u₂...u_n]→[D₁，D₂...D_n]，

D_i＝{uu_d1，u_d2...u_di}

wherein [ u ]₁，u₂...u_n]Is a time-ordered sentence of the original linear text, D ═ D₁，D₂...D_n]Is a decoupled dialog list. Wherein each dialog D_iBy a sentence, i.e. u, in the original linear text_d1，u_d2...u_diAnd (4) extracting the components. The sample D thus extracted can be used as an input of the model.

FIG. 2 is a flow chart of a hierarchy of model prediction according to the present invention. The model level flow chart comprises two main parts: problem prediction models and solution extraction models. The two models share the same model structure and different parameters, and are respectively used for predicting whether the current conversation contains a problem or not and extracting a statement corresponding to a solution.

Step 2.1 dialog D is first of all introduced_iDivided into two parts, one being a head piece_iCorresponding to the part containing the problem; the other part is a main body B_iIncluding the solution that needs to be extracted. Binary D_i＝<H_i，B_i>The entire candidate dialog can be constructed. Therefore, the invention can input the head into the problem model to train the problem prediction and use the main body part to train the solution prediction, thereby simplifying the training steps and the expenditure.

Step 2.2.1 use BERT-based independent statement coding, chose "[ CLS]And outputting, namely encoding the vocabulary sequence into sentence encoding with 800 dimensions. Based on the current coding, the model constructs the context window relationship of the 2k +1 dimension_i-k...u_i-1) Statement code set, current statement code and context (u)_i+1...u_i+k) Three formed by sentence coding setTuples for subsequent context-based encoding:

win_i＝[u_i-k...u_i...u_i+k]

step 2.2.2 is the context-dependent statement code composed of three components, including a text feature extractor, a heuristic feature extractor and a context feature extractor.

The text feature extractor selects a three-layer convolution deep network model, and dimension reduction statement features are achieved while semantics are kept. Selecting convolution kernels

And sentence embedding x ═ u_iThe constructed feature vector is:

γ_t＝ReLU(W·x_t∶t+h-1+b)

γ＝[γ₁，γ₂...γ_n-h+1]

where ReLU is the activation function, W and b are the convolution kernel parameters, γ_tIs an output characteristic diagram, t is a specific position of statement coding, h is a convolution kernel and a coding window size, n is a coding length of a single statement, and x_t：t+h-1And (4) embedding the h code vector with the length of the starting position t in the code x for the statement, wherein gamma is all feature map sets output after a sliding window. Outputting model feature vector of any layer through maximum pooling layer

The output dimensions of the three layers are 1024, 512 and 256 respectively, and finally the text feature extractor outputs a text feature vector gamma of 256 dimensions as the sentence code after dimension reduction.

The heuristic characteristic extractor selects a heuristic characteristic extractor based on attributes, comprises heuristic characteristic codes of key words, structures, themes, emotions and roles, and finally outputs 29-dimensional heuristic characteristic codes ξ_iAnd the semantic information extraction module is used for extracting the high-level semantic information of the sentence. Specific heuristic feature classifications, variables, descriptions, and examples are shown in Table 1.

TABLE 1

The context feature extractor is combined with a window mechanism, a local attention mechanism and a weight vector are used for predicting a certain sentence and a weight value related to a specific sentence in a window context, and context-related sentence codes are obtained in an accumulation and sum mode. The model constructs a triple by selecting a key-value pair mode: (h)_Q，h_K，h_V)＝W^QKV·(u_i，u_s，u_s) Wherein h is_QQuery vector, h, for attribute_KFor query-based key vectors, h_VFor a corresponding vector of values, W^QKVTo encode Q, K, V full connection layer matrix, u_i，u_sFor the current candidate sentence coding and the sentence coding at a specific position in the window, u is satisfied_s∈win_i，win_iRepresenting the window vector of dimension 2k +1 above. The model constructs the attention weight of a specific position by using a dot product similarity mode:

wherein, score (h)_Q，h_K) A score vector representing key-based query attribute weight, s represents the position of the current statement, i represents the position of the context statement for which the local attribute weight score between us needs to be calculated, k represents the window size in which σ/2 in normal distribution is half, a_sU representing output_sAnd context specific location statement u_iThe weight of the local attention in between.

And accumulating the weights of the specific positions to obtain a final code vector:

where d is the dimension of a single statement vector within the window, a 128-dimensional context-dependent statement vector can ultimately be output. The vectors output by the three components are spliced to obtain the complete context-dependent statement code:

step 2.2.3 is full-link prediction, which is input into two models based on statement coding through two full-link layers to respectively judge whether the two models are problems and solutions. The statement of a given header is coded as

The statement of the subject is coded as

The full-connection layer selects a two-classification prediction problem, and a solution is extracted:

wherein → represents a function mapping relationship, FC represents a full connection layer, and I represents a head statement u_HProblem indicator of u_HHead statement, P (I | u), representing dialogue splitting_H) Denotes the probability that the head sentence is predicted as a problem, S denotes a solution indicator of the body sentence, u_BSet of body statements, P (S | u), representing a dialogue split_B) Representing the probability of predicting as a solution for all subject statements.

To optimize this model, this step uses cross-entropy to analyze the difference in loss of probability and true value, training the model:

Loss_I＝-y_H·log P(I|ut_H)，

therein, Loss_ILoss function, y, representing problem prediction_HTrue tag, Loss, corresponding to the presentation of problem prediction_SLoss function, y, representing solution prediction_iSolution real tag, u, representing the ith subject statement_BiDenotes the ith body sentence, and B denotes a body indicator.

Combining the problem after the training convergence and the solution model, as shown in fig. 3, is a flowchart of the model application of the present invention. And 3.1, decoupling the conversation, inputting a real-time chat log into the crowd-sourcing model, and obtaining a structured conversation sample by a conversation decoupling technology. And 3.2, model prediction is performed, a certain record of the existing sample is sequentially input, and after the head and the main body are separated, whether the head is a problem in the development process is detected through a problem model. If the detection problem is false, discarding the current record and selecting the next record; otherwise, the body of the current record is extracted and the input solution model detects sentences that satisfy the solution description. And 3.3, integrating and archiving, extracting a dialog set predicted as a question, combining predicted sentences and storing the combined predicted sentences into a candidate question-answer knowledge base. Specific examples of the "problem-solution" knowledge base obtained and the recommended strategy are shown in table 2.

TABLE 2

The present invention evaluated F1 values for the extraction effect of 171 "problem-solution" over multiple baselines and projects, and found to be over 30% above baseline in problem detection and over 20% above solution extraction with relatively high accuracy and stability. Meanwhile, a 30K problem-solution pair is disclosed on 11 other community projects, and the fact that the crowd-sourcing model can promote knowledge sharing and improve problem solving efficiency is proved, so that software development based on chat communities is promoted.

Another embodiment of the present invention provides a storage medium having a computer program stored therein, the computer program performing the method of the present invention.

Another embodiment of the present invention provides an electronic device, which includes a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method of the present invention.

Other embodiments of the invention:

1) aiming at the problem that the position of the problem is deviated in the dialogue data of the context feature extractor, a Graph Attention Network (GAT) can be selected for more accurate extraction and a solution;

2) for the problem of 'problem-solution' iterative update possibly caused by the change of project version information in the heuristic feature extractor, time features such as open source project versions and the like can be added in the heuristic features in table 1;

3) extracting a model for an existing problem prediction model and solution may present a problem with multiple stages (e.g., for problem I)₁Analytic solution S₁May cause new problems I₂Need to adopt S₂Can perfectly solve the current I₁Two "problem-solution" knowledge pairs may thus be output:<I₁,[step1:S₁；step2:S₂]>and<I₂,S₂>) A more perfect knowledge base can be constructed by adopting an extraction method based on a neural network + rule mode;

4) aiming at the problem that the extracted solution sentences are not smooth enough, a solution with higher quality can be constructed by adopting the scheme of extraction type abstract and word connection prediction;

5) an intelligent recommendation algorithm can be established for the problem-solution manual recommendation time-consuming problem of table 2, and simultaneously, since a single problem may have a plurality of possible solutions, a de-duplication knowledge base and a solution confidence ranking algorithm can be optimized for automatically recommending a plurality of possible solutions for the StackOverflow unsolved problem and ranking on the basis of the confidence.

The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims

1. A crowd-sourcing-oriented problem and solution automatic extraction method is characterized by comprising the following steps:

and adopting a problem-solution prediction network, extracting problems and solutions from the decomposed conversation, and constructing a problem and solution knowledge base by using the extracted problems and solutions.

2. The method of claim 1, wherein decoupling conversations of the real-time chat log comprises preprocessing data by text analysis and splitting conversations using a conversation decoupling model.

3. The method of claim 1, wherein the data preprocessing comprises:

1) capturing linear text data in online platform texts by using a crawler, and collecting chat records of a certain duration through a chat platform;

4) and calculating the consistency of adjacent sentences by using a Baidu artificial intelligence cloud and using the confusion index, and combining the adjacent sentences of which the confusion is lower than a set threshold value into a new sentence.

4. The method of claim 1, wherein the dialogue decoupling model employs a linear feedforward neural network comprising 2-layer, 512-dimensional hidden layer vectors.

5. The method of claim 1, wherein the problem-solution prediction network comprises a syntax coding layer, a context-dependent syntax coding layer, and an output layer.

6. The method of claim 5, wherein the syntax encoding layer comprises:

1) a BERT model for coding the sentence, the model being pre-trained on the text and fine-tuned on the decoupled dialogue data;

7. The method of claim 5, wherein the context dependent sentence coding layer uses three feature extractors to extract codes containing context information of the dialog and feature information of the sentence itself, the three feature extractors comprising:

8. The method of claim 5, wherein the output layer uses the stitched text feature vector, heuristic feature vector, and context feature vector to predict whether a problem and a solution, respectively, using two fully-connected layers.

9. A storage medium, characterized in that a computer program is stored in the storage medium, which computer program performs the method of any of claims 1-8.

10. An electronic device, comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the method of any of claims 1-8.