CN113743096A

CN113743096A - Crowdsourcing test report similarity detection method based on natural language processing

Info

Publication number: CN113743096A
Application number: CN202010487202.9A
Authority: CN
Inventors: 房春荣; 曹振飞; 王旭; 虞圣呈; 恽叶霄; 李彤宇
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-03

Abstract

A method for detecting similarity of crowdsourcing test reports based on natural language processing comprises the steps of detecting similarity of complex test reports submitted by crowdsourcing workers by adopting a natural language processing technology, wherein the function is to carry out preprocessing of Chinese Word segmentation, stop Word removal and the like on the crowdsourcing reports, representing sentences represented by Word groups after preprocessing into Word vectors by using a Word2Vec technology, selecting a cosine similarity measurement mode to calculate distances among the Word vectors, training by adopting a semantic model trained according to a large amount of previous crowdsourcing report data, taking the Word vectors as input of K-Means cluster analysis, carrying out cluster analysis on the Word vectors, classifying similar reports into the same class according to a set similarity threshold, and accurately measuring the similarity between the crowdsourcing test reports.

Description

Crowdsourcing test report similarity detection method based on natural language processing

Technical Field

The invention belongs to the field of software engineering, and relates to application of natural language processing in the field of software engineering, which is used for detecting code similarity.

Background

Similar crowdsourcing test report detection is a key technology for improving the utilization rate of the crowdsourcing report and reducing the workload of reading repeated reports by testers. And the crowdsourcing test report is a result fed back to a tester after crowdsourcing workers complete a task formulated by an initiator, and the tester guides the reproduction and positioning of the Bug according to the crowdsourcing report. If a large number of duplicate content descriptions in the numerous test reports describe the same Bug content, the tester cannot know in advance whether the Bug described in the report is mentioned before, so that the tester needs to waste a large amount of time on reading the duplicate reports, which is not helpful for the tester to duplicate and locate the Bug. Researchers are therefore very concerned with the problem of detecting similar crowdsourcing test reports to help improve the effectiveness of crowdsourcing test reports.

The similarity of the crowdsourcing test reports may be due to two reasons:

1) the first reason is as follows: since each crowdsourcing worker participates in all testing tasks, it is inevitable that multiple crowdsourcing workers find the same Bug, and multiple workers adopt similar words and sentences to deformably describe the same Bug, thereby resulting in duplicate content in multiple crowdsourcing reports.

2) The second reason is that: since crowdsourced testing provides a monetary incentive, there may be some behavior for malicious workers to copy others' test reports to cheat on rewards.

For the crowdsourcing test report caused by the second reason, the similarity is usually very high, and most text contents are completely the same, so that the detection effect of the traditional text similarity analysis on the similar report is better. However, for the reason one, the words and sentences in the plurality of test reports are similar in meaning and are not identical. For similar crowdsourcing test reports caused by such reasons, the detection effect of the traditional plain text similarity analysis is not ideal.

There are many methods of detecting similar Bug reports today. Runeson et al analyzed that it was difficult to compare the similarity of two reports to a canonical method for the feature that defect reports were mostly written in structured natural language, so they identified duplicate terms in the reports by a plain text natural language processing technique. Sun et al propose a search tool to test the similarity between two reports, which not only exploits the relevance of the textual content of the abstract and description fields in the report, but also reports the similarity of non-textual fields, such as products, components, versions, etc. The tool also expands an effective similarity formula in the BM 25F-information retrieval community, uses a two-round stochastic echelon descent method, automatically optimizes the retrieval process aiming at a specific Bug library by a supervised learning method, and further improves the accuracy of detection of the similarity report. The first method has a good detection effect on similar reports with most of text contents identical, but does not consider the first reason. The second method considers the situation of the first reason, but because the specific Bug library involved in the first method is not specific to the field of the crowdsourcing report, the similarity detection of the crowdsourcing report is still not ideal.

Disclosure of Invention

The invention aims to solve the problems that: in the similarity detection of the current crowdsourcing test report, the detection effect of a similar report which has the same meaning but has different text contents is not good.

The technical scheme of the invention is as follows: a method for detecting similarity of crowdsourcing test reports based on natural language processing comprises the following steps:

1) firstly, establishing a crowdsourcing test corpus, and training a supervised semantic model on the basis of the crowdsourcing test corpus:

1.1) pick typical Bug scenarios based on past massive crowdsourcing test report data, first give a typical description of the Bug by an expert team for each particular scenario, and collect two types of descriptions for the scenario. The description of the first category is not exactly the same as the typical description, but the concrete representation of the Bug in the scene is well illustrated by the expression with similar meaning. The second type describes the same text content as the typical description with most, but describes completely different bugs by modifying a small amount of content (e.g., subject, predicate).

1.2) aiming at the collected data, constructing a proper noun corpus of the crowdsourcing test, and summarizing a crowdsourcing test synonym library.

1.3) aiming at the collected data, artificially marking whether the descriptions are similar or not, and training a crowdsourcing test report semantic model based on a calculation model of a neural network. And (5) through multiple iterations and parameter adjustment, an ideal detection effect is achieved on the test set, and the semantic model is trained.

2) Next, input processing is performed. The input crowdsourcing test report is preprocessed firstly, and the crowdsourcing test report is processed by the stop word list summarized previously. And then, reporting and segmenting words by using a Chinese word segmentation tool JieBa, selecting only words contained in the corpus summarized in the step (1) for segmented results, and then replacing the near-meaning words according to a near-meaning word library. And finishing the input preprocessing work and outputting the word group corresponding to the report.

3) And then, taking a Word group list corresponding to each report as input, representing the Word groups corresponding to the reports after Word segmentation into types capable of being calculated by a computer by using the Word2Vec technology, and calculating a Word embedding vector of each Word group.

4) Selecting the characteristics of Word frequency, n-gram, part of speech and the like, vectorizing the characteristics, (Word2Vec utilizes the co-occurrence characteristics of texts in a window) and taking the characteristics as the input of a semantic model for calculation.

5) And (3) taking the word embedding vector and the feature vector corresponding to the word group as the input of the semantic model trained in the step (1) for training. The distance between vectors, i.e. the similarity between reports, is calculated using cosine similarity as a criterion for the measure of distance between vectors.

6) And performing clustering analysis by using a K-means method according to the calculated result, classifying the reports with high similarity into the same class, and finally obtaining a clustering result and a similar report cluster.

The invention is characterized in that: 1. and constructing a crowdsourcing test corpus and a synonym library by collecting typical Bug scenes and descriptions in the crowdsourcing test report field. 2. A crowd-sourced test report semantic model for supervised learning based on a neural network is trained. 3. Word embedding vectors are computed from the phrase library using Word2 Vec. 4. And calculating the similarity between reports by using cosine similarity. 5. And carrying out clustering analysis according to a K-means method, and inducing high similarity report clusters.

The invention has the beneficial effects that: through the summarized typical Bug scenes and descriptions in the field of crowd-sourced test reports, and the semantic model based on supervised learning of neural networks, similarities between multiple reports that are not identical in content but describe the same Bug can be effectively identified.

Drawings

FIG. 1 is an overall flow chart of the present invention.

FIG. 2 is a partial data example of a crowdsourcing report data set in accordance with the present invention

FIGS. 3-6 are examples of clusters of partially similar report classes after cluster analysis of a data set in accordance with the present invention.

Detailed Description

The invention relates to several key technologies, namely jieba Word segmentation, a Word2Vec model, K-means clustering and an LSTM-DSSM deep learning model.

1. jieba word segmentation

jieba is the best current Python Chinese word segmentation component, which mainly has the following 3 characteristics. 1. Support 3-middle word segmentation model

Formula (II): precision mode, full mode, search engine mode. 2. Support traditional Chinese character segmentation. 3. Custom dictionaries are supported.

2. Word2Vec model

Word2Vec, a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic such text. The network is represented in words and input words in adjacent positions are guessed. Under the assumption of the bag of words model in Word2Vec, the order of the words is unimportant. After training is completed, the Word2Vec model can be used to map each Word to a vector, which can be used to represent the Word-to-Word relationship, and the vector is a neural network hidden layer.

3. K-means clustering

The K-means clustering algorithm is a clustering analysis algorithm for iterative solution, and comprises the steps of randomly selecting K objects as initial clustering centers, then calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no object is reassigned to a different cluster, no cluster center changes again, and the sum of squared errors is locally minimal.

4. LSTM-DSSM deep learning model

The LSTM-DSSM inherits the RNN, mainly aiming at the defect that the CNN-DSSM cannot capture the context features at a longer distance. The method is a modification of a DSSM model, and is mainly applied to the calculation of semantic similarity in the field of natural language processing.

The following describes the steps of the method with a specific example and shows the results.

We picked 20000 Bug reports of different content written in a uniform format.

The experimental environment is as follows: ubuntu 16.04 LTS, running memory 8GB, storing 5126B SSD

The overall process of the invention is shown in fig. 1, and the specific implementation steps are as follows:

1) and carrying out input preprocessing. Firstly, stopping word processing is carried out on 20000 crowdsourcing test reports, then, a Chinese word segmentation tool jieba is used for carrying out report word segmentation, only words which are received into a corpus are reserved, near-meaning word replacement is carried out, and a word group corresponding to the report is output;

2) taking a word group corresponding to the report as the input of a word2vec algorithm, calculating a word embedding vector corresponding to each report, wherein the generated word vector is 200-dimensional, 10 words in the upper 5 words and the lower 5 words are considered, and a skip-gram method is adopted;

3) and calculating the variance of each feature by adopting a method selection method, selecting the feature with the variance larger than a threshold value according to the threshold value, and removing the feature with small value change. And respectively and independently calculating a certain statistical index of each variable, and judging which indexes are important to be removed according to the indexes. Outputting the vectorized feature vector after the feature selection is finished;

4) taking the word embedding vector and the feature vector corresponding to the report as the input of a trained semantic model for training;

5) cosine similarity is selected as a measurement standard of the similarity, the distance between the vectors is measured by using the size of an included angle between the two vectors, and the distance is smaller when the vector between the vectors is smaller, namely the two vectors are more similar. Therefore, the size of an included angle of the word embedding vector corresponding to the report is calculated, namely the similarity of the two reports;

6) taking the result of model training as the input of a K-means clustering method, firstly selecting a K value according to an empirical value formula

Since our experimental sample is 20000 pieces of data, the K value is 100. Therefore, we randomly select 100 points as cluster centers, calculate clusters from each point to 100 cluster centers, and then assign the point to the nearest cluster center, thus forming 100 clusters. The mean value for each cluster is then recalculated. The above steps are repeated until the mean value no longer changes. Examples of the class clusters output after clustering are shown in fig. 3-6.

Claims

1. A method for detecting similarity of crowdsourcing test reports based on natural language processing is characterized by constructing a crowdsourcing test corpus, training a supervised semantic model on the basis of the crowdsourcing test corpus, performing input preprocessing by utilizing corpus words, stop words and a near-sense Word library, calculating Word embedding vectors corresponding to reports by utilizing a Word2Vec technology, selecting feature vectorization, calculating report similarity by taking cosine similarity as measurement, performing clustering analysis by adopting K-means, and finally obtaining report clusters with high similarity.

2. The method for detecting similarity of crowdsourcing test reports based on natural language processing as claimed in claim 1, wherein a crowdsourcing test corpus is constructed, a supervised semantic model is trained on the basis of the crowdsourcing test corpus, and the crowdsourcing test corpus is divided mainly by the following steps:

1) first, a typical Bug scenario is picked based on past massive crowdsourced test report data, a typical description of the Bug is first given by an expert team for each particular scenario, and two types of descriptions for the scenario are collected. The description of the first category is not identical to the typical description, but the concrete performance of the Bug in the scene is well clarified through expressions with similar meanings; the second type describes the same text content as the typical description with most part, but describes completely different bugs by modifying a small amount of content (e.g. subject, predicate);

2) aiming at the collected data, a proper noun corpus of crowdsourcing test is constructed, and a crowdsourcing test word library is summarized;

3) aiming at the collected data, artificially marking whether the descriptions are similar or not, and training a crowdsourcing test report semantic model based on a calculation model of a neural network; and (5) through multiple iterations and parameter adjustment, an ideal detection effect is achieved on the test set, and the semantic model is trained.

3. The method for detecting similarity of crowdsourcing test reports based on natural language processing as claimed in claim 1, wherein the method comprises performing input preprocessing using corpus words, stop words and a thesaurus and calculating Word embedding vectors corresponding to reports using Word2Vec technology; firstly preprocessing an input crowdsourcing test report, and performing stop word processing on the crowdsourcing test report through a stop word list summarized previously; then, reporting and segmenting words by using a Chinese word segmentation tool jieba, selecting only words stored in the summarized corpus as to the segmented results, and then replacing the near meaning words according to a near meaning word bank; finishing input preprocessing work and outputting word groups corresponding to the reports; taking the word group corresponding to the report as the input of a word2vec algorithm, calculating a word embedding vector corresponding to each report, wherein the generated word vector is 200-dimensional, 10 words in the upper 5 words and the lower 5 words are considered, and a skip-gram method is adopted.

4. The method for detecting similarity of crowdsourcing test reports based on natural language processing as recited in claim 1, wherein feature vectorization is selected and report similarity is calculated using cosine similarity measurement, variance of each feature is calculated using method selection, features with variance larger than a threshold are selected according to the threshold, and features with small change in value are removed; respectively and independently calculating a certain statistical index of each variable, and judging which indexes are important to be removed according to the indexes; outputting vectorized feature vectors after feature selection is finished, calculating report similarity by taking cosine similarity as measurement, measuring the distance between the vectors by using the size of an included angle between the two vectors, wherein the smaller the vector between the vectors is, the smaller the distance is, namely the two vectors are more similar; therefore, the size of the included angle of the word embedding vector corresponding to the report is calculated, namely the similarity of the two reports.

5. The method for detecting similarity of crowdsourcing test reports based on natural language processing as claimed in claim 1, wherein K-means is used for clustering analysis to obtain report clusters with high similarity; taking the result of model training as the input of a K-means clustering method, firstly selecting a K value according to an empirical value formula

Since our experimental sample is 20000 pieces of data, the K value is 100; therefore, 100 points are randomly selected as clustering centers, clustering from each point to 100 clustering centers is calculated, and then the point is divided into the nearest clustering centers, so that 100 clusters are formed; the mean value for each cluster is then recalculated. The above steps are repeated until the mean value no longer changes.