CN114239576A

CN114239576A - Issue label classification method based on topic model and convolutional neural network

Info

Publication number: CN114239576A
Application number: CN202111566439.7A
Authority: CN
Inventors: 张卫丰; 徐俊辉
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-25

Abstract

The invention discloses an issue label classification method based on a topic model and a convolutional neural network, which comprises the following steps of: 1) data collection: acquiring required issue data as a data set through a GitHub Archive; 2) data processing: cleaning the collected issue text; 3) LDA extracts topics and words: carrying out LDA model processing on each issue text; 4) customizing the theme: defining a theme by user, and counting high-frequency words under the corresponding theme; 5) theme fusion: combining the LDA with a user-defined theme to construct a theme word dictionary; 6) vector splicing: splicing the word vector and the theme vector; 7) data rebalancing: balancing the training set by applying a data rebalancing technique; 8) model training: adopting a convolutional neural network to identify and classify the issue; the invention discloses an issue label classification method based on a topic model and a convolutional neural network, which realizes automatic classification and identification of issue labels.

Description

Issue label classification method based on topic model and convolutional neural network

Technical Field

The invention belongs to the field of development and maintenance of software engineering, and particularly relates to an issue label classification method based on a topic model and a convolutional neural network.

Background

The GitHub is one of the popular project development cooperation communication and sharing platforms at present, and helps developers to coordinate development by using Wiki and git, so that the working efficiency is improved. To date, github has over 1200 ten thousand open source items, and this number is growing.

Maintenance is a vital task during the life cycle of a software project. First, the source code should be kept up to date and eliminate any potential deficiencies in performance and correctness. On the other hand, maintenance personnel must devote as little time and effort as possible to address the above tasks to keep the cost of software maintenance low. The issue tracking system is an important means for maintenance personnel to implement strict and efficient software evolution tasks. In the issue tracking system, maintenance personnel report problem tickets or potential problems, manage them and track their progress.

GitHub provides an integrated lightweight issue tracking system, and the problem submitter need only provide a short text summary (containing a title and an optional description) to be able to report a new problem to the project hosted on GitHub. This simplified approach reduces barriers to participation in the project, attracts more inexperienced external contributors, but complicates the development team's task of maintaining the software. To address these problems, github provides a customizable labeling system that developers can use to label and manage problem reports. Tags can provide immediate clues about problems, but often there are many issues pending at the same time during the actual team development process, and manually assigning tags to problems is a labor intensive and time consuming task. In fact, in github, the tagging mechanism is not fully utilized. There is therefore a need for a method that can accomplish automatic classification based on the topic header and description information.

The invention plans to realize the label classification of the issue based on the topic model and the convolutional neural network, namely, after a new issue is proposed, the maintainers of the project can timely know the topic content of the issue, thereby greatly saving the working time of the maintainers, improving the working efficiency and automatically predicting the label of the issue.

Disclosure of Invention

In order to solve the above problems, the present invention provides an issue label classification method based on a topic model and a convolutional neural network, which is characterized in that:

1) cleaning the issue data set, and extracting and processing the issue data set into a data set meeting the requirement;

2) based on an LDA topic model, giving the topic of the topic and the description information in a probability distribution mode to obtain a topic and a word set under the topic, and fusing the word set under the user-defined topic to form a final topic word library;

3) fusing the topic vector and the word vector to form a final input vector, rebalancing the data set, and adopting random over-sampling algorithms to optimize input, thereby improving the classification effect of the model;

4) based on a convolutional neural network, inputting a fusion vector, extracting features of a convolutional layer, reducing dimensionality of a pooling layer while keeping main features, outputting probability of each category by softmax, and verifying a label classification effect by 10-fold cross validation after model training.

3. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 1): texts in other languages exist in the issue data set and need to be deleted, so that the texts are guaranteed to be pure English texts; deleting links, code segments and emoticons existing in the title and description of the issue; then, the abbreviation is expanded, so that the identification of related words of the subject is facilitated; finally, word tokenization divides a sentence into individual words.

4. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 2): inputting the processed text into an LDA model, and extracting a theme and words under the theme; selecting parameters of the model, namely prior distribution alpha of hidden topics in the reaction text, prior distribution beta of words under the reaction hidden topics, the number K of the topics, and observing the rationality of the classified topics; customizing a key theme by user, counting high-frequency words, and selecting 10 high-frequency words as a word set under the theme; training the segmented data set by using Word2Vec, and calculating the cosine similarity of each Word and the subject Word; if the cosine similarity is more than 0.75, adding the words into the word sets under the corresponding subjects, and manually judging that some words appear in more than one set.

5. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 3): fusing the theme word set obtained by LDA with the user-defined theme word set, and merging the near themes; traversing the text, and if the text does not comprise a word set under the subject, skipping; if yes, marking the corresponding theme; training Word vectors, and Word2Vec mining Word senses from the Word granularity level to perform fine semantic expression on texts; the theme vector dimension and the word vector are kept consistent, and the word vector and the theme vector are spliced to form a new input vector, so that the semantic features are guaranteed while the word meaning features are guaranteed; and (4) rebalancing the data set by using random over-sampling algorithms, and optimizing the training and classification effects of the model.

6. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 4): based on a convolutional neural network, using pre-trained word vectors and topic vectors as an embedding layer to obtain an embedding matrix; performing feature extraction on input data through convolution kernel, and performing dimension reduction on features; the pooling layer extracts the most important characteristics, reduces the parameters of the next layer, accelerates the operation speed and avoids the problem of overfitting to a certain extent; a full connection layer, which outputs the probability of each category through softmax; ten times of cross validation was performed after model training.

Based on the topic model and the convolutional neural network, the automatic label classification can be completed according to basic information of issue under the condition that no label is provided; the algorithm combines the topic of the issue title and the description information, applies a topic model, fuses the topic and the word vector, considers that the bug and the enhancement in the issue label occupy most of them, has fewer queuing labels, and applies a data rebalance method to improve the expression of the convolutional neural network model.

Drawings

FIG. 1 is a general design flow diagram in an embodiment of the present invention;

FIG. 2 is a structural diagram of a TextCNN in an embodiment of the present invention;

detailed description of the preferred embodiments

The technical solution of the present invention is further explained with reference to the embodiments according to the drawings.

Examples

In the process of software development, developers create an issue for tracking bugs and conducting software-related discussion to further facilitate management, a system for managing the issue is called BTS, and gitubs also add the function and can be used as a communication tool among software developers. As shown in fig. 1-2, the method for classifying the issue based on the topic model and the convolutional neural network in this embodiment obtains the issue text data through the GitHubArchive, and realizes the automatic identification and classification of the issue label, including the following steps:

1) data collection: acquiring required issue data as a data set through a GitHub Archive;

2) data processing: cleaning the collected issue text;

3) LDA extracts topics and words: carrying out LDA model processing on each issue text;

4) customizing the theme: defining a theme by self, and counting words under the high-frequency theme;

5) theme fusion: fusing the themes and splicing the vectors;

6) data rebalancing: balancing the training set by using random over-sampling examplestechnology;

7) model training: and (4) identifying and classifying the isuse by adopting a convolutional neural network.

Step 1, collecting data, namely executing BigQuery to acquire issue information through Github Archive; selecting the issue with the tags of bug, enhancement and query, and acquiring the corresponding title and description text, wherein the tags occupy most of the issue tags, and the final method has the identification effect according to the tags.

The processing process of the data in the step 2 comprises the following steps: the issue text must be English, the collected data contains Chinese, Korean and Japanese, and the issue text is deleted to avoid language deviation caused by classifying the issue; code segments, links and expressions exist in the issue text, and the text noise is reduced by deleting the code segments, the links and the expressions; deleting stop words which have no clear meaning in the text; the abbreviation is replaced by full name; word tokenization divides a sentence into individual words.

Extracting themes and words, outputting text themes and theme word matrixes by using the LDA training data set, and extracting themes and words under the themes; selecting important parameters of the model, namely prior distribution alpha of hidden topics in the reaction text, prior distribution beta of words under the reaction hidden topics, the number K of the topics, and observing the rationality of the classified topics; and constructing a word set under each topic according to the probability of each word.

Step 4, considering the function of the issue, in order to better represent the subject of the issue, customizing a key subject according to the text in the data set; counting word frequency, and selecting high-frequency words as a word set under a theme; training the segmented data set by using Word2Vec, and calculating the cosine similarity of each Word and the subject Word; if the cosine similarity is larger than 0.75, the words are added into the word sets under the corresponding subjects, and some words may appear in more than one set for manual judgment.

Step 5, the topic word set obtained by LDA is fused with the user-defined topic word set, topics are combined in consideration of the fact that related topics may have similar semantics, and words under the topics are also combined; traversing the text, and if the text does not comprise a word set under the subject, skipping; if yes, marking the corresponding theme; training Word vectors, and Word2Vec mining Word senses from the Word granularity level to perform fine semantic expression on texts; training a theme vector, outputting a theme distribution matrix after the model is trained, and keeping the vector dimension consistent with the word vector; and splicing the word vector and the theme matrix to form a new input vector, so that the semantic features are ensured while the word sense features are ensured.

Step 6, data rebalancing is considered, namely more bugs and enhancements are distributed in the issue data set, fewer query labels are provided, the random over-sampling templates technology is applied to rebalance the training set before the model is trained, and the model training effect is optimized.

Step 7, using a convolutional neural network training model and using a pre-trained word vector as an embedding layer to obtain an embedding matrix; performing feature extraction on input data through convolution kernel, and performing dimension reduction on features; the pooling layer extracts the most important characteristics, reduces the parameters of the next layer, accelerates the operation speed and avoids the problem of overfitting to a certain extent; a full connection layer, which outputs the probability of each category through softmax; and performing cross-validation on the model after training, wherein the indexes are Precision, Recall and F-measure.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. An issue label classification method based on a topic model and a convolutional neural network is characterized in that: processing data of the issue data set, and then extracting a theme and a word set under the theme by an LDA model of the text; then, defining a theme by user and counting word sets under the corresponding theme; then completing vector splicing after constructing a subject word dictionary; and finally, balancing the training set by using a random over-sampling example technology, and training the classification issue by using a convolutional neural network model.

2. The issue label classification method based on the topic model and the convolutional neural network as described in claim 1, which comprises the following steps:

3. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 1), the issue data set contains texts in other languages, which need to be deleted and are guaranteed to be pure English texts; deleting links, code segments and emoticons existing in the title and description of the issue; then, the abbreviation is expanded, so that the identification of related words of the subject is facilitated; finally, word tokenization divides a sentence into individual words.

4. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 2), the processed text is input into the LDA model, and the topic and the words under the topic are extracted; selecting parameters of the model, namely prior distribution alpha of hidden topics in the reaction text, prior distribution beta of words under the reaction hidden topics, the number K of the topics, and observing the rationality of the classified topics; customizing a key theme by user, counting high-frequency words, and selecting 10 high-frequency words as a word set under the theme; training the segmented data set by using Word2Vec, and calculating the cosine similarity of each Word and the subject Word; if the cosine similarity is more than 0.75, adding the words into the word sets under the corresponding subjects, and manually judging that some words appear in more than one set.

5. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 3), the topic word set obtained by LDA and the user-defined topic word set are fused, and the topics with similar semantics are merged; traversing the text, and if the text does not comprise a word set under the subject, skipping; if yes, marking the corresponding theme; training Word vectors, and Word2Vec mining Word senses from the Word granularity level to perform fine semantic expression on texts; the theme vector dimension and the word vector are kept consistent, and the word vector and the theme vector are spliced to form a new input vector, so that the semantic features are guaranteed while the word meaning features are guaranteed; and (4) rebalancing the data set by using random over-sampling algorithms, and optimizing the training and classification effects of the model.

6. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 4), based on the convolutional neural network, pre-trained word vectors and topic vectors are used as an embedding layer to obtain an embedding matrix; performing feature extraction on input data through convolution kernel, and performing dimension reduction on features; the pooling layer extracts the most important characteristics, reduces the parameters of the next layer, accelerates the operation speed and avoids the problem of overfitting to a certain extent; a full connection layer, which outputs the probability of each category through softmax; ten times of cross validation was performed after model training.