CN111797198A

CN111797198A - Method for recognizing bad taste discussion of software architecture from text

Info

Publication number: CN111797198A
Application number: CN202010539516.9A
Authority: CN
Inventors: 梁鹏; 鲁帆; 田方超; 李雪莹
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-06-14
Filing date: 2020-06-14
Publication date: 2020-10-20

Abstract

The invention discloses a method for identifying bad taste discussion of a software architecture from a text, which comprises the following steps: 1) performing text crawling on the question and answer posts of the software development professional question and answer community, and constructing a data set for identifying bad taste discussion of a software architecture; 2) preprocessing the simplified text content of the text in the data set; 3) extracting text features from the text in the step 2) by a natural language processing technology to obtain a processed feature vector data set; 4) after the characteristics of each text are obtained, training a secondary classifier by using a training set; 5) predicting the test concentrated documents by the trained classifiers to obtain classification results, and evaluating the bad taste performance of the classifier recognition software architecture; 6) and comparing results, and analyzing the optimal combination of the feature extraction and the classifier. The invention provides an automated method for identifying bad taste discussion of a software architecture, which can quickly obtain the optimal combination of feature extraction and classification models according to setting.

Description

Method for recognizing bad taste discussion of software architecture from text

Technical Field

The invention relates to the technical field of software engineering, in particular to a method for recognizing bad taste discussion of a software architecture from texts.

Background

In all times of software definition, the complexity of software systems is continuously increased, and due to the increase of software development cost and the gradual improvement of the existing software architecture, developers tend to develop and adapt to the existing systems to meet new requirements rather than building a completely new software system. Developers are therefore also required to perform long-term maintenance and upgrades to software applications. Throughout the life cycle of software, its code is undergoing evolutionary modification. During the evolution of software code, the architecture of the software may produce bad tastes that have a significant negative impact on subsequent evolution. Developers need to correct the "bad taste" found in the system to maintain the system. Bad taste can be divided into three categories according to particle size: architecture bad taste, design bad taste, code bad taste. All three bad tastes can cause different damage to the software quality. Where architectural bad taste is a higher order design problem that continuously and cumulatively negatively impacts system maintenance, and reconfiguring architectural bad taste is more time consuming and laborious than configuring code bad taste and designing bad taste. Therefore, researchers need to discuss and identify various types of bad tastes. Developers and researchers have studied bad tastes of software architectures by referring to documents, books or online resources, and even if relevant examples are found, the quality of the examples constrains research progress. The lack of research and lack of use cases have made the bad taste study of the software architecture difficult. Therefore, the method for acquiring and identifying the bad smell of the software architecture needs to be optimized, and examples related to the bad smell of the software architecture and irrelevant examples are distinguished from search results, so that developers can be helped to quickly acquire research cases to promote related research on the bad smell of the software architecture.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for identifying bad taste discussion of a software architecture from a text, solve the problem of identifying specific subject content from the text, analyze the text content of the question and answer posts of a software development professional question and answer community, and divide the question and answer posts into relevant posts and irrelevant posts of the bad taste of the software architecture by using an automatic classification technology so as to provide a discussion example of the bad taste of the software architecture.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method of identifying a bad taste discussion of a software architecture from text, comprising the steps of:

1) performing text crawling on the question and answer posts of the software development professional question and answer community, manually marking out text posts related to bad taste of the software architecture and irrelevant to the bad taste of the software architecture, using the text posts as a test set and a training set, and constructing a data set for identifying bad taste discussion of the software architecture;

2) preprocessing the simplified text content of the text in the data set;

3) extracting text features from the text in the step 2) by a natural language processing technology to obtain a processed feature vector data set, wherein the feature vector data set comprises: a BoW feature vector dataset, a TF-IDF feature vector dataset, and a Word2Vec feature vector dataset;

4) after the characteristics of each text are obtained, dividing the data set obtained after the processing in the step 3) into a training set and a testing set, and training a second classifier by using the training set;

the method comprises the following specific steps:

respectively training an LR classifier, an RF classifier, an SVM classifier and a KNN classifier according to the three feature data sets obtained in the step 3), and obtaining 3 kinds of classifiers with various combinations of feature extraction and the classification models; predicting the documents in the test set by using the trained classifiers to obtain classification results;

5) the trained classifiers predict the test concentrated documents to obtain classification results, and the performance of the classifier recognition software architecture bad taste question-answer sticker is evaluated, wherein the performance evaluation adopts the following four indexes: accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1-score;

6) and comparing results, analyzing to obtain the optimal combination of the feature extraction and the classifier, and identifying by using the classification model of the final combination.

According to the scheme, the step 1) specifically comprises the following steps:

step 1.1) crawling text data; firstly, searching question-answer labels from a software development question-answer community by taking bad taste of a software architecture as a key word, extracting all the question-answer labels related to the bad taste of the software architecture from a search result, and recording URL links; then, randomly extracting a similar number of irrelevant postings from the irrelevant postings screened out from the search result, and recording URL links; thus forming a balanced data set.

And step 1.2) crawling a title-query-answer in each question-answer by using a URL link, manually marking the question-answers which are related or unrelated to bad taste of the software system structure, and storing the question-answers into a CSV file for use in the subsequent steps.

According to the scheme, the pretreatment in the step 2) comprises the following steps: cleaning data, removing the original form reduction of useless characters and words;

the data is cleaned to delete useless characters and escape characters contained in the webpage text;

the useless character removal is to delete words with the length of less than 3 letters and to perform English stop word processing on the text;

and the original form reduction of the words comprises stem reduction and morphology reduction, and deformed words of all words in the text are reduced into the original forms of the words by utilizing an NLTK toolkit.

According to the scheme, in the step 3), the text features of the text in the step 2) are extracted through a natural language processing technology to obtain a processed feature vector data set

The method comprises the following specific steps:

step 3.1) processing the data set obtained in the step 2) by using a Bag-of-Words technology, calculating the frequency of each word in each document in the text data set, combining the frequency numbers of all the Words into a feature vector of the document, and storing the feature vectors of all the documents obtained in the step as a BoW feature vector data set;

step 3.2) processing the data set obtained in the step 2) by using a TF-IDF technology, calculating a TF value and an IDF value of each word in each document in the text data set, multiplying the TF value and the IDF value to obtain a TF-IDF value serving as the feature of the document, and storing the TF-IDF value as a TF-IDF feature vector data set;

and 3.3) processing the data set obtained in the step 2) by using a Word2Vec technology, converting each Word in each document in the text data set into a vector value in a feature space through a mapping function, averaging vectors of all words in one text to be used as the feature of the document, and storing the feature vectors of all the documents as a Word2Vec feature vector data set.

The invention has the following beneficial effects: an automated technique for recognizing software architecture bad taste discussions from text is provided that can quickly obtain an optimal combination of feature extraction and classification models based on settings.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a method of identifying a bad taste discussion of a software architecture from text, comprising the steps of:

step 1, crawling the question and answer posts of a software development professional question and answer community, and manually marking out text posts related and unrelated to bad taste of a software architecture, thereby constructing a data set for automatically identifying the bad taste discussion of the software architecture;

step 1.1, experimental data are crawled. First, a total of 5950 pieces of data are searched for from 14 pieces of words of "architecture cell", "architecture base defect", "architecture vision", "architecture workflow reagent", "architecture reagent-pattern", and "architecture reagent-pattern" as keywords from the software development question-and-answer community. And selecting the top five items (700) which are most relevant from the search results of each keyword for manual screening, and performing duplicate removal processing on repeated question and answer stickers. After each word is carefully read, all bad-taste related software architecture postings (208) in the result are extracted and URL links thereof are recorded, and then irrelevant postings (187) of the same order of magnitude are randomly extracted from the remaining 492 bad-taste irrelevant postings, and the URL links are recorded, thereby forming a balanced data set.

Step 1.2, crawling the title-query-answer in each question-answer by using the URL link, manually marking the question-answers which are related or unrelated to bad taste of the software system structure, and storing the question-answers into a CSV file for use in the subsequent steps.

Step 2, preprocessing the text in the data set, for example, removing irrelevant characters, vocabularies and the like, and obtaining relatively simplified text content;

the method specifically comprises the following steps:

and 2.1, cleaning the data. Useless characters, such as "… …" and "/" and a series of escape characters often contained in web page text, are first removed.

And 2.2, removing useless characters. And deleting the words with the word length of less than 3 letters by using an NLTK toolkit, and performing English stop word processing on the text.

And 2.3, stem reduction (stemming) and morphology reduction (lemmatization). With the NLTK toolkit, morphemes of all words in the text are restored to the original forms of the words, such as locked and locking are restored to lock.

Step 3, processing the text by using the data set processed in the step 2 and utilizing a natural language processing technology to extract text features;

and (3) processing the data set obtained in the step (2) by using a TF-IDF characteristic extraction technology, calculating a TF value and an IDF value of each word in each document in the text data set, multiplying the TF value and the IDF value to obtain a TF-IDF value serving as the characteristic of the document, and storing the TF-IDF value as a characteristic vector data set.

Step 4, after the characteristics of each text are obtained, dividing the data set obtained after the processing in the step 3 into a training set and a testing set, and training a Random Forest (Random Forest) secondary classifier by using the training set; predicting sentences in the test set by using the trained classifier;

step 5, predicting the test concentrated documents by the trained classifiers to obtain classification results, and evaluating the performance of the classifiers for identifying the bad taste question and answer labels of the software system structure, wherein the performance evaluation adopts the following four indexes: accuracy (Accuracy), Precision (Precision), Recall (Recall), and F1-score.

Through calculation, the accuracy of the classification result obtained by the scheme is 0.643, the accuracy is 0.645, the recall rate is 0.740, and the F1-score is 0.678, which shows that the method can effectively identify the bad taste discussion of the software architecture from the text, and the question and answer is divided into the relevant section and the irrelevant section of the bad taste of the software architecture so as to provide a discussion example of the bad taste of the software architecture.

The classifier in the above scheme of the invention is suitable for identifying the bad taste discussion of the software architecture from the text of the professional question-and-answer community for software development, and in order to expand the application range of the technical scheme of the invention, the invention also provides a method for identifying the bad taste discussion of the software architecture from the text, which comprises the following steps:

step 1.1, experimental data are crawled. First, a total of 5950 pieces of data are searched for from 14 pieces of words of "architecture cell", "architecture base defect", "architecture vision", "architecture workflow reagent", "architecture reagent-pattern", and "architecture reagent-pattern" as keywords from the software development question-and-answer community. And selecting the top five items (700) which are most relevant from the search results of each keyword for manual screening, and performing duplicate removal processing on repeated question and answer stickers. After each sentence is carefully read, all bad-taste relevant software architecture postings (208 in total) are extracted, URL links are recorded, and then the same number of irrelevant postings (187 in total) are randomly extracted from the screened bad-taste irrelevant postings of the software architecture, and the URL links are recorded.

the method specifically comprises the following steps:

and 3.1, processing the data set obtained in the step 2 by using a Bag-of-words (BoW) technology, calculating the frequency of each word in each document in the text data set, and combining the frequency of all the words into a feature vector of the document. And storing the feature vectors of all the documents obtained in the step as a BoW feature vector data set.

And 3.2, processing the data set obtained in the step 2 by using a TF-IDF (Term Frequency-Inverse Document Frequency) technology, calculating a TF value and an IDF value of each word in each Document in the text data set, multiplying the TF value and the IDF value to obtain a TF-IDF value serving as the feature of the Document, and storing the TF-IDF value as a TF-IDF feature vector data set.

And 3.3, processing the data set obtained in the step 2 by using a Word2Vec technology, converting each Word in each document in the text data set into a vector value in a feature space through a mapping function, and averaging vectors of all words in one text to serve as the feature of the document. The feature vectors of all documents are saved as Word2Vec feature vector datasets.

Specifically, steps 3.1 to 3.3 are further operations based on step 2, and steps 3.1 to 3.3 are executed in parallel.

And 4, after the characteristics of each text are obtained, dividing the data set obtained after the processing in the step 3 into a training set and a testing set, and training a secondary classifier by using the training set. And predicting sentences in the test set by using the trained classifier.

And 4.1, respectively training an LR classifier (the parameter of the LR classifier uses a default value) by using an LR (logical regression) classification technology and the three characteristic data sets obtained in the step 3, and predicting the classification result of the documents in the test set by using the trained LR classifier.

And 4.2, training an RF classifier by using an RF (random forest) classification technology and using the three feature data sets obtained in the step 3 (the parameters of the RF classifier use default values), and predicting the classification result of the documents in the test set by using the trained RF classifier.

And 4.3, training an SVM classifier by using an SVM (support Vector machine) classification technology and the three feature data sets obtained in the step 3 (the parameters of the SVM classifier use default values), and predicting the classification result of the documents in the test set by using the trained SVM classifier.

And 4.4, training a KNN classifier by using the three feature data sets obtained in the step 3 by using a KNN (k-Nearest Neighbors) classification technology (the parameters of the KNN classifier use default values), and predicting the classification result of the documents in the test set by using the trained KNN classifier.

Specifically, steps 4.1 to 4.4 are further operations based on step 3, and steps 4.1 to 4.4 are executed in parallel.

Step 5, four indexes of Accuracy (Accuracy), Precision (Precision), Recall (Recall) and F1-score were used to evaluate the performance of the classifier recognition software architecture bad taste question-answer patch.

And 5.1, calculating four evaluation indexes of Accuracy, Precision, Recall and F1-score of 12 algorithm combinations of 3 feature extractions and 4 classification models.

And 5.2, comparing results, and analyzing an optimal feature extraction algorithm, an optimal classification model algorithm and an optimal combination algorithm.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A method for recognizing a bad taste discussion of a software architecture from text, comprising the steps of:

2) preprocessing the simplified text content of the text in the data set;

4) after the characteristics of each text are obtained, dividing the data set obtained after the processing in the step 3) into a training set and a testing set, and training two classifiers in a classifier set by using the training set;

5) the trained classifiers predict the test concentrated documents to obtain classification results, and the performance of the classifier recognition software architecture bad taste question-answer sticker is evaluated, wherein the performance evaluation adopts the following four indexes: accuracy, precision, recall and F1-score;

6) comparing the results, analyzing to obtain the optimal combination of feature extraction and classifier, and identifying the bad taste discussion of the software architecture from the text by using the classification model of the final combination.

2. The method for identifying a bad taste discussion of a software architecture from a text according to claim 1, wherein the step 1) comprises the following steps:

3. The method for recognizing software architecture bad taste discussions from text according to claim 1, wherein said preprocessing in step 2) comprises: cleaning data, removing the original form reduction of useless characters and words;

4. The method for identifying bad taste discussions in software architecture from texts as claimed in claim 1, wherein said step 3) of extracting text features from the texts of step 2) by natural language processing technique to obtain processed feature vector data set

The method comprises the following specific steps:

5. The method for recognizing a bad taste discussion of a software architecture from a text according to claim 1, wherein the step 4) is specifically as follows:

respectively training an LR classifier, an RF classifier, an SVM classifier and a KNN classifier according to the three feature data sets obtained in the step 3), and obtaining 3 kinds of classifiers with various combinations of feature extraction and the classification models; and predicting the documents in the test set by using the trained classifiers to obtain a classification result.