CN114239576A - Issue label classification method based on topic model and convolutional neural network - Google Patents

Issue label classification method based on topic model and convolutional neural network Download PDF

Info

Publication number
CN114239576A
CN114239576A CN202111566439.7A CN202111566439A CN114239576A CN 114239576 A CN114239576 A CN 114239576A CN 202111566439 A CN202111566439 A CN 202111566439A CN 114239576 A CN114239576 A CN 114239576A
Authority
CN
China
Prior art keywords
issue
word
topic
model
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111566439.7A
Other languages
Chinese (zh)
Inventor
张卫丰
徐俊辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202111566439.7A priority Critical patent/CN114239576A/en
Publication of CN114239576A publication Critical patent/CN114239576A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an issue label classification method based on a topic model and a convolutional neural network, which comprises the following steps of: 1) data collection: acquiring required issue data as a data set through a GitHub Archive; 2) data processing: cleaning the collected issue text; 3) LDA extracts topics and words: carrying out LDA model processing on each issue text; 4) customizing the theme: defining a theme by user, and counting high-frequency words under the corresponding theme; 5) theme fusion: combining the LDA with a user-defined theme to construct a theme word dictionary; 6) vector splicing: splicing the word vector and the theme vector; 7) data rebalancing: balancing the training set by applying a data rebalancing technique; 8) model training: adopting a convolutional neural network to identify and classify the issue; the invention discloses an issue label classification method based on a topic model and a convolutional neural network, which realizes automatic classification and identification of issue labels.

Description

Issue label classification method based on topic model and convolutional neural network
Technical Field
The invention belongs to the field of development and maintenance of software engineering, and particularly relates to an issue label classification method based on a topic model and a convolutional neural network.
Background
The GitHub is one of the popular project development cooperation communication and sharing platforms at present, and helps developers to coordinate development by using Wiki and git, so that the working efficiency is improved. To date, github has over 1200 ten thousand open source items, and this number is growing.
Maintenance is a vital task during the life cycle of a software project. First, the source code should be kept up to date and eliminate any potential deficiencies in performance and correctness. On the other hand, maintenance personnel must devote as little time and effort as possible to address the above tasks to keep the cost of software maintenance low. The issue tracking system is an important means for maintenance personnel to implement strict and efficient software evolution tasks. In the issue tracking system, maintenance personnel report problem tickets or potential problems, manage them and track their progress.
GitHub provides an integrated lightweight issue tracking system, and the problem submitter need only provide a short text summary (containing a title and an optional description) to be able to report a new problem to the project hosted on GitHub. This simplified approach reduces barriers to participation in the project, attracts more inexperienced external contributors, but complicates the development team's task of maintaining the software. To address these problems, github provides a customizable labeling system that developers can use to label and manage problem reports. Tags can provide immediate clues about problems, but often there are many issues pending at the same time during the actual team development process, and manually assigning tags to problems is a labor intensive and time consuming task. In fact, in github, the tagging mechanism is not fully utilized. There is therefore a need for a method that can accomplish automatic classification based on the topic header and description information.
The invention plans to realize the label classification of the issue based on the topic model and the convolutional neural network, namely, after a new issue is proposed, the maintainers of the project can timely know the topic content of the issue, thereby greatly saving the working time of the maintainers, improving the working efficiency and automatically predicting the label of the issue.
Disclosure of Invention
In order to solve the above problems, the present invention provides an issue label classification method based on a topic model and a convolutional neural network, which is characterized in that:
1) cleaning the issue data set, and extracting and processing the issue data set into a data set meeting the requirement;
2) based on an LDA topic model, giving the topic of the topic and the description information in a probability distribution mode to obtain a topic and a word set under the topic, and fusing the word set under the user-defined topic to form a final topic word library;
3) fusing the topic vector and the word vector to form a final input vector, rebalancing the data set, and adopting random over-sampling algorithms to optimize input, thereby improving the classification effect of the model;
4) based on a convolutional neural network, inputting a fusion vector, extracting features of a convolutional layer, reducing dimensionality of a pooling layer while keeping main features, outputting probability of each category by softmax, and verifying a label classification effect by 10-fold cross validation after model training.
3. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 1): texts in other languages exist in the issue data set and need to be deleted, so that the texts are guaranteed to be pure English texts; deleting links, code segments and emoticons existing in the title and description of the issue; then, the abbreviation is expanded, so that the identification of related words of the subject is facilitated; finally, word tokenization divides a sentence into individual words.
4. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 2): inputting the processed text into an LDA model, and extracting a theme and words under the theme; selecting parameters of the model, namely prior distribution alpha of hidden topics in the reaction text, prior distribution beta of words under the reaction hidden topics, the number K of the topics, and observing the rationality of the classified topics; customizing a key theme by user, counting high-frequency words, and selecting 10 high-frequency words as a word set under the theme; training the segmented data set by using Word2Vec, and calculating the cosine similarity of each Word and the subject Word; if the cosine similarity is more than 0.75, adding the words into the word sets under the corresponding subjects, and manually judging that some words appear in more than one set.
5. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 3): fusing the theme word set obtained by LDA with the user-defined theme word set, and merging the near themes; traversing the text, and if the text does not comprise a word set under the subject, skipping; if yes, marking the corresponding theme; training Word vectors, and Word2Vec mining Word senses from the Word granularity level to perform fine semantic expression on texts; the theme vector dimension and the word vector are kept consistent, and the word vector and the theme vector are spliced to form a new input vector, so that the semantic features are guaranteed while the word meaning features are guaranteed; and (4) rebalancing the data set by using random over-sampling algorithms, and optimizing the training and classification effects of the model.
6. The issue label classification method based on the topic model and the convolutional neural network comprises the following specific steps of 4): based on a convolutional neural network, using pre-trained word vectors and topic vectors as an embedding layer to obtain an embedding matrix; performing feature extraction on input data through convolution kernel, and performing dimension reduction on features; the pooling layer extracts the most important characteristics, reduces the parameters of the next layer, accelerates the operation speed and avoids the problem of overfitting to a certain extent; a full connection layer, which outputs the probability of each category through softmax; ten times of cross validation was performed after model training.
Based on the topic model and the convolutional neural network, the automatic label classification can be completed according to basic information of issue under the condition that no label is provided; the algorithm combines the topic of the issue title and the description information, applies a topic model, fuses the topic and the word vector, considers that the bug and the enhancement in the issue label occupy most of them, has fewer queuing labels, and applies a data rebalance method to improve the expression of the convolutional neural network model.
Drawings
FIG. 1 is a general design flow diagram in an embodiment of the present invention;
FIG. 2 is a structural diagram of a TextCNN in an embodiment of the present invention;
detailed description of the preferred embodiments
The technical solution of the present invention is further explained with reference to the embodiments according to the drawings.
Examples
In the process of software development, developers create an issue for tracking bugs and conducting software-related discussion to further facilitate management, a system for managing the issue is called BTS, and gitubs also add the function and can be used as a communication tool among software developers. As shown in fig. 1-2, the method for classifying the issue based on the topic model and the convolutional neural network in this embodiment obtains the issue text data through the GitHubArchive, and realizes the automatic identification and classification of the issue label, including the following steps:
1) data collection: acquiring required issue data as a data set through a GitHub Archive;
2) data processing: cleaning the collected issue text;
3) LDA extracts topics and words: carrying out LDA model processing on each issue text;
4) customizing the theme: defining a theme by self, and counting words under the high-frequency theme;
5) theme fusion: fusing the themes and splicing the vectors;
6) data rebalancing: balancing the training set by using random over-sampling examplestechnology;
7) model training: and (4) identifying and classifying the isuse by adopting a convolutional neural network.
Step 1, collecting data, namely executing BigQuery to acquire issue information through Github Archive; selecting the issue with the tags of bug, enhancement and query, and acquiring the corresponding title and description text, wherein the tags occupy most of the issue tags, and the final method has the identification effect according to the tags.
The processing process of the data in the step 2 comprises the following steps: the issue text must be English, the collected data contains Chinese, Korean and Japanese, and the issue text is deleted to avoid language deviation caused by classifying the issue; code segments, links and expressions exist in the issue text, and the text noise is reduced by deleting the code segments, the links and the expressions; deleting stop words which have no clear meaning in the text; the abbreviation is replaced by full name; word tokenization divides a sentence into individual words.
Extracting themes and words, outputting text themes and theme word matrixes by using the LDA training data set, and extracting themes and words under the themes; selecting important parameters of the model, namely prior distribution alpha of hidden topics in the reaction text, prior distribution beta of words under the reaction hidden topics, the number K of the topics, and observing the rationality of the classified topics; and constructing a word set under each topic according to the probability of each word.
Step 4, considering the function of the issue, in order to better represent the subject of the issue, customizing a key subject according to the text in the data set; counting word frequency, and selecting high-frequency words as a word set under a theme; training the segmented data set by using Word2Vec, and calculating the cosine similarity of each Word and the subject Word; if the cosine similarity is larger than 0.75, the words are added into the word sets under the corresponding subjects, and some words may appear in more than one set for manual judgment.
Step 5, the topic word set obtained by LDA is fused with the user-defined topic word set, topics are combined in consideration of the fact that related topics may have similar semantics, and words under the topics are also combined; traversing the text, and if the text does not comprise a word set under the subject, skipping; if yes, marking the corresponding theme; training Word vectors, and Word2Vec mining Word senses from the Word granularity level to perform fine semantic expression on texts; training a theme vector, outputting a theme distribution matrix after the model is trained, and keeping the vector dimension consistent with the word vector; and splicing the word vector and the theme matrix to form a new input vector, so that the semantic features are ensured while the word sense features are ensured.
Step 6, data rebalancing is considered, namely more bugs and enhancements are distributed in the issue data set, fewer query labels are provided, the random over-sampling templates technology is applied to rebalance the training set before the model is trained, and the model training effect is optimized.
Step 7, using a convolutional neural network training model and using a pre-trained word vector as an embedding layer to obtain an embedding matrix; performing feature extraction on input data through convolution kernel, and performing dimension reduction on features; the pooling layer extracts the most important characteristics, reduces the parameters of the next layer, accelerates the operation speed and avoids the problem of overfitting to a certain extent; a full connection layer, which outputs the probability of each category through softmax; and performing cross-validation on the model after training, wherein the indexes are Precision, Recall and F-measure.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (6)

1. An issue label classification method based on a topic model and a convolutional neural network is characterized in that: processing data of the issue data set, and then extracting a theme and a word set under the theme by an LDA model of the text; then, defining a theme by user and counting word sets under the corresponding theme; then completing vector splicing after constructing a subject word dictionary; and finally, balancing the training set by using a random over-sampling example technology, and training the classification issue by using a convolutional neural network model.
2. The issue label classification method based on the topic model and the convolutional neural network as described in claim 1, which comprises the following steps:
1) cleaning the issue data set, and extracting and processing the issue data set into a data set meeting the requirement;
2) based on an LDA topic model, giving the topic of the topic and the description information in a probability distribution mode to obtain a topic and a word set under the topic, and fusing the word set under the user-defined topic to form a final topic word library;
3) fusing the topic vector and the word vector to form a final input vector, rebalancing the data set, and adopting random over-sampling algorithms to optimize input, thereby improving the classification effect of the model;
4) based on a convolutional neural network, inputting a fusion vector, extracting features of a convolutional layer, reducing dimensionality of a pooling layer while keeping main features, outputting probability of each category by softmax, and verifying a label classification effect by 10-fold cross validation after model training.
3. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 1), the issue data set contains texts in other languages, which need to be deleted and are guaranteed to be pure English texts; deleting links, code segments and emoticons existing in the title and description of the issue; then, the abbreviation is expanded, so that the identification of related words of the subject is facilitated; finally, word tokenization divides a sentence into individual words.
4. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 2), the processed text is input into the LDA model, and the topic and the words under the topic are extracted; selecting parameters of the model, namely prior distribution alpha of hidden topics in the reaction text, prior distribution beta of words under the reaction hidden topics, the number K of the topics, and observing the rationality of the classified topics; customizing a key theme by user, counting high-frequency words, and selecting 10 high-frequency words as a word set under the theme; training the segmented data set by using Word2Vec, and calculating the cosine similarity of each Word and the subject Word; if the cosine similarity is more than 0.75, adding the words into the word sets under the corresponding subjects, and manually judging that some words appear in more than one set.
5. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 3), the topic word set obtained by LDA and the user-defined topic word set are fused, and the topics with similar semantics are merged; traversing the text, and if the text does not comprise a word set under the subject, skipping; if yes, marking the corresponding theme; training Word vectors, and Word2Vec mining Word senses from the Word granularity level to perform fine semantic expression on texts; the theme vector dimension and the word vector are kept consistent, and the word vector and the theme vector are spliced to form a new input vector, so that the semantic features are guaranteed while the word meaning features are guaranteed; and (4) rebalancing the data set by using random over-sampling algorithms, and optimizing the training and classification effects of the model.
6. The issue label classification method based on the topic model and the convolutional neural network as claimed in claim 2, wherein in step 4), based on the convolutional neural network, pre-trained word vectors and topic vectors are used as an embedding layer to obtain an embedding matrix; performing feature extraction on input data through convolution kernel, and performing dimension reduction on features; the pooling layer extracts the most important characteristics, reduces the parameters of the next layer, accelerates the operation speed and avoids the problem of overfitting to a certain extent; a full connection layer, which outputs the probability of each category through softmax; ten times of cross validation was performed after model training.
Based on the topic model and the convolutional neural network, the automatic label classification can be completed according to basic information of issue under the condition that no label is provided; the algorithm combines the topic of the issue title and the description information, applies a topic model, fuses the topic and the word vector, considers that the bug and the enhancement in the issue label occupy most of them, has fewer queuing labels, and applies a data rebalance method to improve the expression of the convolutional neural network model.
CN202111566439.7A 2021-12-20 2021-12-20 Issue label classification method based on topic model and convolutional neural network Pending CN114239576A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111566439.7A CN114239576A (en) 2021-12-20 2021-12-20 Issue label classification method based on topic model and convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111566439.7A CN114239576A (en) 2021-12-20 2021-12-20 Issue label classification method based on topic model and convolutional neural network

Publications (1)

Publication Number Publication Date
CN114239576A true CN114239576A (en) 2022-03-25

Family

ID=80759871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111566439.7A Pending CN114239576A (en) 2021-12-20 2021-12-20 Issue label classification method based on topic model and convolutional neural network

Country Status (1)

Country Link
CN (1) CN114239576A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493619A (en) * 2023-12-29 2024-02-02 安徽思高智能科技有限公司 Event graph-based method and system for predicting closing time of issues

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493619A (en) * 2023-12-29 2024-02-02 安徽思高智能科技有限公司 Event graph-based method and system for predicting closing time of issues
CN117493619B (en) * 2023-12-29 2024-03-26 安徽思高智能科技有限公司 Event graph-based method and system for predicting closing time of issues

Similar Documents

Publication Publication Date Title
CN110825882B (en) Knowledge graph-based information system management method
US11113477B2 (en) Visualizing comment sentiment
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
CN113807098A (en) Model training method and device, electronic equipment and storage medium
WO2022226716A1 (en) Deep learning-based java program internal annotation generation method and system
CN113609838B (en) Document information extraction and mapping method and system
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN113255321A (en) Financial field chapter-level event extraction method based on article entity word dependency relationship
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
Fazayeli et al. Towards auto-labelling issue reports for pull-based software development using text mining approach
CN112328475A (en) Defect positioning method for multiple suspicious code files
CN114818718A (en) Contract text recognition method and device
CN114356924A (en) Method and apparatus for extracting data from structured documents
CN114239576A (en) Issue label classification method based on topic model and convolutional neural network
CN112181814B (en) Multi-label marking method for defect report
CN112015866B (en) Method, device, electronic equipment and storage medium for generating synonymous text
Pittaras et al. A taxonomic system for failure cause analysis of open source AI incidents
CN113806538B (en) Label extraction model training method, device, equipment and storage medium
KR20220068937A (en) Standard Industrial Classification Based on Machine Learning Approach
CN114443803A (en) Text information mining method and device, electronic equipment and storage medium
CN112748951B (en) XGboost-based self-acceptance technology debt multi-classification method
CA3088692C (en) Visualizing comment sentiment
CN117874261B (en) Question-answer type event extraction method based on course learning and related equipment
CN117852553B (en) Language processing system for extracting component transaction scene information based on chat record
US20220253728A1 (en) Method and System for Determining and Reclassifying Valuable Words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination