CN110928764A

CN110928764A - Automated mobile application crowdsourcing test report evaluation method and computer storage medium

Info

Publication number: CN110928764A
Application number: CN201910957929.6A
Authority: CN
Inventors: 姚奕; 刘语婵; 刘佳洛; 顾晓东; 杨帆; 陈文科; 刘伟豪
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2020-03-27
Anticipated expiration: 2039-10-10
Also published as: CN110928764B

Abstract

The invention discloses a mobile application crowdsourcing test report automatic evaluation method and a computer storage medium, wherein the method comprises the following steps of: 1) inputting a test report set and historical credibility of workers, eliminating invalid test reports, and performing word segmentation and stop word processing on the remaining test report sets; 2) clustering the test reports according to the found defects to form a plurality of types of defect test reports, and selecting the grade with the maximum proportion as the defect grade of the type of defect test report; 3) constructing a plurality of normative indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normative score of the test report; 4) and obtaining the final score of the test report according to the defect grade and the normative score. The invention overcomes the defect that the existing crowdsourcing test report quality evaluation method is lack of test report content quality evaluation, and improves the overall performance of the crowdsourcing test platform.

Description

Automated mobile application crowdsourcing test report evaluation method and computer storage medium

Technical Field

The present invention relates to automated evaluation methods and computer storage media, and more particularly, to automated evaluation methods and computer storage media for crowdsourcing test reports for mobile applications.

Background

The mobile application crowdsourcing test works by distributing test tasks of mobile application software, which are executed by employees in the past, to anonymous network users for testing through the Internet. Due to the improvement of defect discovery efficiency caused by the diversity, complementarity and the like of people in the crowdsourcing test mode, crowdsourcing tests are receiving wide attention in the industry, and a plurality of crowdsourcing test business platforms (such as Applause, baidu mtc, moocetest, Testin and the like) appear. The workings of crowd-sourced testing can help task requesters gain a large number of liberty workers who can solve practical problems by exploiting their wisdom. However, in the testing process, some malicious workers do not work seriously in order to pursue the maximum benefit of the malicious workers, the quality of the submitted testing result is low, and the testing quality is difficult to be effectively guaranteed, which may cause serious loss to task demanders. In response to this problem, many researchers have begun their research from test reports. Some studies attempt to reduce the cost of manual review by reducing the number of test reports reviewed, propose crowd-sourced test report priorities, and help developers to be able to detect test reports that reveal different defects as much as possible within limited resources and time by prioritizing the test reports using textual information and screenshot information. Still other studies have addressed the problem of fuzzy clustering of crowdsourced test reports by dividing test reports into clusters by an automated method, and developers need only review a representative test report in each cluster, greatly reducing the number of test reports reviewed.

However, these studies neglect the effect of test report quality on the efficiency of manual review. The evaluation process can not be separated from the manual technology, and automatic evaluation is realized. In response to this problem, there are also crowdsourced test report quality assessment frameworks that automatically model the quality of test reports, which typically implement quality assessment by defining a series of quantifiable metrics to measure the characteristics or attributes desired in defect reports and requirements specifications. First, NLP techniques are employed to preprocess crowdsourced test reports. The framework then defines a series of quantifiable metrics to measure the desired attributes of the test reports, and determines a numerical value for each metric based on the textual content of each test report. Finally, the numerical value of each index is converted into a nominal value (i.e., good, bad) by using a step transformation function, and the nominal values of all indexes are aggregated to predict the quality of the test report.

Although the existing crowdsourced test report quality assessment framework can measure the quality of a test report from different aspects to a certain extent through a quantifiable index, the content of the quality assessment is still only limited to the format specification of the description information of the test report, and the specific defects in the report content are not involved. Therefore, the evaluation framework is only used for evaluating the normative of the description information of the test report and lacks of evaluating the quality of the test result.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides an automatic evaluation method of a mobile application crowdsourcing test report and a computer storage medium, which solve the defect that the existing crowdsourcing test report quality evaluation method is lack of quality evaluation of test report content, evaluate the performance of workers from two aspects of the level of the found defect content and the test report normalization, can correctly measure the task completion quality of the workers, eliminate workers with sporadic attitude or malicious mining, stimulate the rational excellent workers to complete tasks with high quality, and improve the overall performance of a crowdsourcing test platform.

The technical scheme is as follows: the invention discloses an automatic evaluation method for a crowdsourcing test report of mobile application, which comprises the following steps:

(1) inputting a test report set and historical credibility of workers, eliminating invalid test reports, and performing word segmentation and stop word processing on the remaining test report sets;

(2) clustering the test reports processed in the step (1) according to the found defects to form a plurality of types of defect test reports, setting the historical credibility of workers as the grade weight of the defects for weighting, and selecting the grade with the maximum proportion as the defect grade of the type of defect test reports;

(3) constructing a plurality of normative indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normative score of the test report;

(4) obtaining a final score of the test report according to the defect grade in the step (2) and the normative score in the step (3);

wherein, the execution of the steps (2) and (3) is not in sequence.

Further, the method for eliminating the invalid test report in the step (1) comprises the following steps:

if the text length of the defect description information of the test report is less than or equal to 4, rejecting the test report; and if the test report defect description information regular matching contains ([ A ] [ P ]) | ([ N ] [ O ]) | ([ N ] [ D ].

Further, the method for clustering according to the found defects in the step (2) comprises the following steps:

(1) calculating the TF-IDF value of the test report through a TF-IDF algorithm;

(2) taking the tf-idf values of all test reports as a clustering data object O_n＝{x₁,x₂,...,x_n-clustering, where n is the number of data objects.

Further, the clustering method is an MMDBK algorithm, and the specific steps are as follows:

(1) from data object O_n＝{x₁,x₂,...,x_nSelecting two objects with the farthest distance;

(2) finding out all objects with the distance to the clustering center smaller than the threshold value d through neighbor searching, adding the objects into the neighbor class of the center, and recalculating the center of the neighbor class;

(3) calculating a DBI clustering index;

(4) judging whether the DBI clustering index is minimum, if not, repeating the step (2) and the step (3), if so, stopping circulation, and classifying the residual data objects into the most adjacent class;

(5) and outputting a clustering result.

Further, the normative indexes in the step (3) comprise: text length, readability, action words, object words, negative words, fuzzy words, and interface elements;

the text length index and the readability index correspond to a convex expansion measurement function, and the convex expansion measurement function is that when the index is in x₂And x₃In between, the index is good comment, when the index is in x₁And x₂Or at x₃And x₄When the index is less than x, the index is a middle score₁Or greater than x₄When the index is poor, x₁、x₂、x₃、x₄Setting parameters;

the action word index and the interface element index correspond to an increased expansion measurement function, the increased expansion measurement function is that when the index is less than or equal to 1, the index is poor, when the index is greater than 1 and less than or equal to 2, the index is medium, and when the index is greater than 2, the index is good;

the target word index and the negative word index correspond to convex measurement functions, the convex measurement functions are that when the index is less than or equal to 1 or greater than 2, the index is poor, and when the index is greater than 1 and less than or equal to 2, the index is good;

the fuzzy word index corresponds to a descending type expansion measurement function, the descending type expansion measurement function is that when the index is less than or equal to 1, the index is good evaluation, when the index is greater than 1 and less than or equal to 2, the index is medium evaluation, and when the index is greater than 2, the index is bad evaluation.

Further, the conversion method of the normative score is as follows:

and sequencing all the test reports according to the number sequence of the good comments, the medium comments and the poor comments, wherein the normative score of all the 7 indexes with good comments is max, the normative score of all the 7 indexes with poor comments is min, and the normative score of the ith test report sequenced from low to high is (max-min) i/32+ min.

Further, the method for calculating the final score in the step (4) comprises the following steps:

final score 0.7 defect rating +0.3 normative score.

The computer storage medium of the present invention, on which a computer program is stored, is characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 7.

Has the advantages that: the invention provides an automatic evaluation method of a mobile application crowdsourcing test report based on a clustering algorithm and normalization measurement, which can realize automatic evaluation of the crowdsourcing test report from two aspects of report content and report normalization, can not be limited to format and normalization factors of the test report, and also takes specific defect levels in the report content into consideration to comprehensively evaluate the normalization and the quality of test results in the test report, thereby ensuring that a crowdsourcing test platform objectively measures the quality of the crowdsourcing test worker completing test tasks from multiple aspects, and realizing automatic calculation in the whole evaluation process, so that the evaluation process is separated from manual technology, and the efficiency of examining the test report is greatly improved.

Drawings

FIG. 1 is an overall flowchart of the method in the present embodiment;

FIG. 2 is a diagram of four metric functions of the normative index;

fig. 3 is a diagram illustrating four extended metric functions of the normative index.

Detailed Description

According to the specific implementation mode of the invention, firstly, the crowdsourcing test report is preprocessed, and then, starting from the two aspects of the severity of the defect and the normalization of the test report, on one hand, the number of the types of the defect is determined by clustering the MMDBK algorithm of the test report, the severity grade of each type is evaluated according to the historical credibility weight of a worker, on the other hand, the normalization score of each test report is calculated by describing the normalization measurement index and the discrete measurement function through the report, and finally, the score finally obtained by each test report is calculated by integrating the two aspects. The method flow of the embodiment of the invention is shown in fig. 1, and the mobile application crowdsourcing test report set TR and the worker historical credibility GU are input, and the method comprises the following steps:

step1, inputting a test report set and historical credibility of workers, eliminating invalid test reports through a filtering rule, directly rating the reports as 0 score, and performing word segmentation and word deactivation processing on the remaining valid reports.

Due to the inertia and welfare of people, it is considered that some workers reduce their workload in order to obtain more profits. Therefore, the algorithm needs to screen out the inferior or invalid test reports to ensure the quality of the test reports during the later report analysis. Through the analysis of a plurality of test report data sets, it is found that some special test reports exist in the test report sets, and the following 2 special statements exist in the description information field of the Bug to be analyzed: (1) short sentence: the Bug describes that a column is empty or contains only a few words without any readability description. Only a few test steps are described; (2) pseudo-sentences: in the Bug description column, the test case is mainly indicated to be successfully executed, no defect exists, and the test passes. Test reports containing one of the two above would be invalid test reports and they would not contain any information about software defects. For example, a Bug description field is empty or a Bug description field is 1 in length, this means that the test report is meaningless. If the test report Bug description field is 'Bug not found', the test is passed. Therefore, in order to improve the processing efficiency of the test report data set, it is necessary to filter out the invalid test reports before processing them.

Through analysis and summarization, on one hand, the test report description information is found to be too little and can be directly filtered. On the other hand, description information of a test report containing a pseudo statement mostly consists of statement statements, and is randomly sampled by 10%, and the description information is generally found to contain special character strings, such as "test pass", "execution success", "not found", "no Bug", and the like. From this, the following two filter rules can be concluded:

(1) if the text length described by the Bug is less than or equal to 4, filtering the test report;

(2) if the Bug's profile canonical match can contain ([ A ] [ P ]) | ([ N ] [ O ]) | ([ N ] [ D ]. Wherein A is an action word, P is a positive word, N is a negative word, O is an object word, D is an action word, Q is a quantitative word, and specific words contained in the action words and the Q are shown in Table 1.

TABLE 1 regular sentence glossary

After filtering out invalid test reports, the test reports need to be further preprocessed, and because the test reports are formed by Chinese natural language, the test reports are processed by adopting NLP technology, which mainly comprises word segmentation and stop word removal, wherein the stop word removal is to remove stop words, namely to remove meaningless words in the reports. In view of the difficulty of Chinese character segmentation, the present document performs the processing by means of a specialized Chinese NLP tool NLPIR segmentation tool in python.

And 2, solving word frequency and inverse text frequency indexes of the effective report set by using a TF-IDF algorithm, taking the word frequency and the inverse text frequency indexes as Index objects of clustering, and clustering the contents of the test reports by using an MMDBK (Max-Min and Davies-Bouldin Index based K-means) clustering algorithm, wherein the same type of test reports report the same defect.

Because the test reports correspond to the found defects one by one, the method adopts a method of directly clustering the test reports, the TF-IDF value of the test report after pretreatment is calculated through a TF-IDF algorithm, the TF-IDF value represents word frequency and an inverse text frequency index, and the TF-IDF values of all the test reports are used as a clustering data object O_n＝{x₁,x₂,...,x_nAnd (5) clustering the test reports by utilizing an MMDBK algorithm.

The MMDBK algorithm is a clustering algorithm improved aiming at the defects of the K-means algorithm, the determination of the number K of clusters in the K-means algorithm and the selection of K clustering centers are improved, the optimal number of clusters is determined and new clustering centers are selected by using Davies-Bouldin Index (abbreviated as DBI) clustering indexes and a maximum-minimum distance method so as to ensure that the clusters have smaller similarity, and the implementation flow is as follows:

step 1: from n data objects O_n＝{x₁,x₂,...,x_nSelecting two objects x with the farthest distance₁And x₂(ii) a In order to avoid the problem that the clustering boundary is fuzzy due to the fact that the K-means algorithm is too close to the clustering center when the clustering center is selected, the two objects with the farthest distance are selected to form two initial clusters, and therefore low similarity between different classes in subsequent calculation is guaranteed.

step 2: finding out all objects which are less than the threshold value d from the cluster center through neighbor searching, adding all the objects into the neighbor class of the center, and recalculating the center of the neighbor class.

After the initial two clustering centers are found and the class centers are updated, determining the number of clusters and finding the remaining K-2 clustering centers, so that the lowest similarity among the clustering centers is the key of the algorithm, and the method specifically comprises the following steps: known as c₁，c₂Two initial clustering centers, respectively calculating the residual objects to c₁And c₂Distance D of_j1And D_j2If D is_k＝max{min(D_j1,D_j2) J is 1,2, n, and D_k＞θ·D₁₂，D₁₂Is c₁And c₂Then take x_jAs the 3 rd cluster center, c₃＝x_j. If c is₃If so, calculate D_k＝max{min(D_j1,D_j2,D_j3) J is 1,2, n, if D_k＞θ·D₁₂Then finding the 4 th clustering center; and so on until D_k＞θ·D₁₂If not, finishing searching the clustering center.

step 3: calculating and updating DBI value, comparing the DBI value with the DBI value of the previous round (comparing the first round with the initial value), and if the DBI value is less than or equal to the value of the previous round, meeting the circulation condition

And (4) clustering indexes of Davies-Bouldin Index (DBI) are used for evaluating clustering results. The DBI clustering index is a non-fuzzy cluster evaluation index and mainly takes two factors of the separation degree between classes and the cohesion degree in the classes as the basis, namely, the dissimilarity between different classes is high, and the similarity of data objects in the same class is high. When the distance between the data objects in the class is smaller and the distance between the classes is larger, the DBI value is smaller, and the clustering result under the clustering number is optimal. The calculation formula of the distance between classes and the distance in the classes is as follows:

where x denotes a data object within the ith class, v_iRepresenting the centroid of the ith class, C_iRepresents the number of data objects in the ith class, | | · | | represents the Euclidean distance, S_iData objects and centroids v in the ith class_iStandard error of (d)_i,jRepresenting the euclidean distance between the ith class and the jth class centroid. The Davies-Bouldin Index formula is:

wherein S_iIndicates the intra-class distance, S, of the ith class_jDenotes the intra-class distance of the jth class, d_i,jThe distance between the i and the j-th class is shown, and K represents the number of clusters.

The good clustering result should be that the intra-class distance between the same classes is small, the inter-class distance is large, and the condition can be satisfied, that is, the smaller the numerator is, the larger the denominator is, the smaller the DBI value is, that is, the optimal clustering number can be obtained by the value.

step 4: if the condition is met, searching a new clustering center, and repeating the step (2-3);

step 5: if the condition is not met, stopping circulation, and classifying the residual data objects into the most adjacent class;

step 6: and outputting a clustering result.

After clustering is completed, each class can be regarded as a test report set of a certain defect of the software to be tested, and for one defect, evaluation is carried out according to the grade of the test report filled by a worker. The test report of a worker with high test history reliability is considered preferentially, normalization processing is carried out according to the history reliability of the worker participated by the platform, the test report is used as a weight coefficient to respectively calculate the light, general, serious and fatal proportion coefficient of the defects in the test report set, and the defect grade with the highest proportion is selected as the final grade of the defects.

The magnitude of the impact is defined as the severity of the software defect and is summarized in four levels:

(1) and (3) light: some small defects such as wrongly written characters, poor character typesetting and the like have little influence on functions and can be normally used.

(2) In general: less serious errors such as lack of implementation of the secondary functional part, poor user interface, and long response time.

(3) Severe: the main function module is not realized, the main function is partially lost, and the secondary function is completely lost.

(4) Fatal: the finger can cause system crash, or cause data loss, complete loss of primary function, etc.

The whole test report data set is clustered into N types of defects which are respectively a test report set Cla₁、Cla₂、……、Cla_NEach class contains 4 levels of test reports, where the set of each level of each class contains m_i,j(j ═ 1,2,3,4 for light, normal, severe, and crash, respectively) test reports. Cla_i,j,kThe kth test report representing the jth level in the ith defect, and the historical credibility of the worker who submitted the test report is U_i,j,kAnd (3) expressing the contribution degree of the submitter of the ith and jth test report, wherein TR is the number of all test reports, namely the corresponding number of workers. How to carry out standard normalization on the contribution degree, and solving the average value of historical credibility of all users as follows:

the variance of the user historical contribution values is:

after normalization, the method comprises the following steps:

respectively calculating the light, general, serious and fatal proportionality coefficients aiming at each type of defects

Where δ is the test report coefficient, typically set to 0.8.

We chose B_i＝max(B_i,1,B_i,2,B_i,3B_i,4) The assigned grade is used as the grade of the i-th type defect. And simultaneously setting value ratios of each grade, wherein the grades are respectively 10, 7.5, 5 and 2.5 in fatal, serious, general and light scores.

And 3, for each type of test report, setting the historical credibility of the existing workers as the grade weight of the test report, determining the proportion of each grade by combining the quantity of each type of defect, selecting the grade with the maximum proportion as the final grade of the type of defect, and obtaining the corresponding defect grade score according to the grade setting.

And 4, starting from the report description normalization, defining 7 quantifiable normalization indexes and corresponding step type measurement functions, converting the quantitative values of the indexes into quality grade through the measurement functions, and obtaining corresponding normalization scores by utilizing a linear interpolation method according to the quality quantity of the 7 indexes, namely 'good', 'medium', 'poor'.

The normalization of the test report reflects the capability and attitude of crowdsourcing test workers in completing tasks, and is one of the factors for evaluating the test report of the workers, so in order to evaluate the normalization of the test report more accurately, the method constructs 7 quantifiable indexes to measure the quality of the test report, and mainly evaluates defect description information or test steps:

(1) text length: the text length refers to the number of Chinese characters contained in the defect description information in the test report, and the test report with the text length kept at a proper value is good in quality.

(2) Readability: the reading difficulty of the text can be measured, and the measurement formula is Y-14.9596X₁+39.07746X₂-2.48X₃Wherein X is₁Denotes the ratio of difficult words, X₂Representing the number of sentences, X₃Representing the average number of strokes of a Chinese character.

(3) Action words: when describing defects, sequences of actions are often described in the testing step, and these actions are the key to the test worker triggering the interface or interface events of the software. Therefore, attention needs to be paid to action words in the test report, such as "open", "click", "exit", and the like.

(4) The object word: when a tester finds a Bug in software, they describe the behavior with words that can reflect system errors, such as "problem", "Bug", and so on.

(5) Negative words: when the tester finds the software defect, they will use negative words to describe the system function missing, such as "missing", "failed", etc.

(6) Fuzzy words: during testing, a tester, if encountering ambiguous or ambiguous defects, prefers to use a somewhat ambiguous vocabulary for description, which may make understanding of the test report difficult. Ambiguous words include "almost", "few", "possibly", "generally", etc.

(7) Interface elements: the mobile application software interface is composed of a plurality of interactive components, corresponding components need to be clicked, input, slid and the like during software testing, and interface elements such as buttons, sliders and the like are certainly included when action sequences are described in the testing step.

Considering that the length of the test step is determined according to the software complexity and cannot be measured by using a proper text length, the defect description information is separately quantized by using the text length, and the defect description information and the test step are jointly quantized by using the remaining 6 indexes aiming at the word segmentation text which is preprocessed. Finally, a 7-dimensional index vector is generated for each test report, and each test report can be represented by a 7-dimensional vector. The evaluated quality values are expressed as "good score", "medium score" and "bad score", and a metric function is constructed to convert the continuous values into discrete values.

The metric function is divided into four categories: growth metric function, descent metric function, convex metric function, concave metric function, as shown in fig. 2:

(1) growth metric function: when the index is less than x₁When the index is greater than x, the index is poor₁When the index is high, the index is good.

(2) A falling metric function: when the index is less than x₁When the index is larger than x, the index is good comment₁In the case of the above, the index was regarded as poor score.

(3) Convex metric function: when the index is at x₁And x₂When the index is less than x, the index is good₁Or greater than x₂In the case of the above, the index was regarded as poor score.

(4) Concave metric function: when the index is at x₁And x₂In between, the index is poor, when the index is less than x₁Or greater than x₂When the index is high, the index is good. However, the above-mentioned measurement function can only be divided into good and bad discrete values, and the measurement function needs to be expanded to increase the parameter x of the boundary₂、x₃、x₄So that the extended metric function can be divided into three discrete values of good, medium and difference, and a schematic diagram thereof is shown in fig. 3.

In table 2, the parameter intervals are expressed in the form of 0-a-b- ∞ for the increasing and decreasing metric functions, and are expressed as three intervals. For the convex and concave metric functions, the parameter intervals are represented using the form 0-a-b-c-d- ∞, as five intervals.

TABLE 2 metric function of evaluation index

The most suitable length of the text in the test report should be 15-30, and too long or too short text will affect the quality of the test report, so the readability is also evaluated by the convex expansion metric function, and the specific parameters can be obtained by experimental debugging. The indexes of the object words and the negative words are expressed by definite numbers, if the text contains more than 0 or 2 object words or negative words, the test report is considered to be low quality, if only 1 error word or negative word is reported, the test report is considered to be high quality, and the test report is based on the convex measurement function of the prototype. Action words and interface elements are scaled by an incremental expansion metric function, i.e., if more words are included, the better the quality of the test report on the index is considered. Only the fuzzy word index belongs to the descending type expansion measurement function, and the more fuzzy words, the worse the quality of the test report.

The parameters set by the measurement function are classified, so that the 'good evaluation', 'medium evaluation' and 'poor evaluation' of the test report can be obtained according to the indexes, in order to summarize the evaluations of a plurality of indexes, three different evaluations correspond to different score grades, and invalid test reports are all set to be 0 score; the optimal test report, if 7 good scores are obtained, the score is 10; the worst valid test report, 7 "bad scores" were set to a base of 1 point, with evenly distributed scores in the middle. Namely, the method for converting the 7 index qualities into the normative scores is a linear interpolation method. The quality of 5 indexes can be 'good', 'medium', 'poor', the quality of 2 indexes can be 'good' and 'poor', finally, the combination of the quantity of 7 indexes 'good', 'medium', 'poor' is summarized, and 33 quality evaluation results exist in total. The scores of the 33 results were determined by linear interpolation, with the highest 7 good scores being max, here 10, and the lowest 7 bad scores being min, here 1, and the score of the ith result from low to high in all quality evaluations being (max-min) × i/32+ min. The final quality score that can be used to test the report normalization is shown in table 3.

TABLE 3 normative scores of test reports

And 5, obtaining the final score of the test report through weighting and summing the defect grade and the normative score. The defect level score weight is 0.7 and the normative score weight is 0.3.

The pseudo code realized by the program of the method is as follows:

inputting: test report set, worker historical credibility

CTRAEA(TR，GU)

1for i in range (n)// pretreatment stage

2if TR_iCompliance with Filter rules

3mTR_i=0，delete TR_iV/invalid report is scored as 0 and culled

4 number of statistical invalid reports// this number evaluates the accuracy of the filtering rule

5newTR = split (TR)// for all active report participles, stop word

6CN = Cluster (newTR)// Cluster evaluation defect rating

7for i in range (N) traverse each type of defect

8for j in range (4)// go through each level of such defects

9for k in range(m_i,j) // all test reports traversing this level

11B_i＝max(B_i,1,B_i,2,B_i,3B_i,4) V/determining Defect grade score

12DG_i＝ratio(B_i) V/determining Defect grade score

13for i in range (m)// test report normalization metric

14ZB_i＝newTERQAF(newTR_i)

15QG_i＝search(ZB_i)

16Rw_i＝a*DG_i+(1-a)*QG_iV/weighted summation of Defect grade score and normative score, a is typically 0.7

17 output mTR_iAnd Rw_i

The implementation effect of the invention is verified as follows, and the used experimental data set is acquired from a Kibug public-testing platform, which stands in 2012 and is a platform for distributing, collecting and analyzing crowdsourcing tasks. 4 test tasks are collected, namely drawing music, shopping, internet cloud and excellent podcasts respectively.

The embodiment mainly extracts several columns with attributes of "device version", "network", "level", "Bug description" and "test step" as the research keywords. When the 4 crowdsourcing test tasks are ended, most defects are detected, each submitted test report is audited by a manager and is marked with a label of "valid" and "invalid", the annotating person also records the number of the defects of each mobile application software, and the annotation result is counted in the following table 4.

TABLE 4 test report annotation results

In total, 1380 test reports are collected, including 291 music pictures, 408 shopping pictures, 238 internet clouds, and 443 podcasts. The infinite number of test reports in the four test report sets are 61, 193, 149, and 238, respectively.

In the preprocessing stage of this embodiment, all test reports are used as input data of a filter and run to screen out valid and invalid test reports, where the number of the labeled invalid test reports is 61, 193, 149, and 238, after filtering by two rules, 61, 189, 147, and 232 invalid test reports are correctly filtered, and the filtering accuracy is 97.48% -100%, which indicates that the filtering rule is valid, but the number of the incorrectly filtered reports is greater than that of the incorrectly filtered reports, and after manual inspection, it is found that a mainly valid test report statement contains a "no" word, and can be regularly matched, and thus filtered.

After the preprocessing is completed, the test reports are clustered and the grade of each type of defect is determined through the MMDBK algorithm, and the accuracy of the defect evaluation and the marking result is compared in a table 5.

TABLE 5 evaluation results of Defect grade

In table 5, the accuracy of the defect level evaluation results is about 90%, and the highest accuracy reaches 93.65%, and the results show that the accuracy is higher by comprehensively evaluating the submitter capability and the defect level parameters after clustering.

In the text length index for quality assessment, to determine the parameter x₁、x₂、x₃And x₄The specific numerical values of the three parameters are fixed by a control variable method, the value of one parameter is gradually increased, the precision of the prediction result is compared by utilizing the evaluation index of the relative error, and the optimal values of the 4 parameters of the text length are respectively 9, 15, 23 and 32. The readability parameters are likewise optimally taken to be-5, -1, 6 and 12. And finally scoring by using the parameters obtained above to obtain a comprehensive score of the importance and the normalization of the test report. The relative error of each test report score was calculated and averaged to determine the error of the software evaluation for this method, as shown in table 6 below.

TABLE 6 Final score accuracy index results

From experimental data, the average relative error of the score and the annotation score of the evaluation test report of the method in the embodiment is 9.24%, and the average relative error of 4 software is not more than 10%, so that the accuracy and the efficiency of the method are shown. The invention can be applied to a test report scoring mechanism of the crowdsourcing test platform, and can automatically evaluate the quality score of the test report in the process, thereby ensuring that the crowdsourcing test platform objectively measures the quality of the crowdsourcing test worker for completing the test task from multiple aspects, reducing the expert cost required by the crowdsourcing test platform and increasing the commercial benefit. For example, when mobile application software is tested and evaluated, the method can effectively evaluate efficiency, reduce evaluation cost, evaluate objective evaluation from both report content and report normalization, and improve accuracy and reliability of evaluation results.

The embodiments of the present invention, if implemented in the form of software functional modules and sold or used as independent products, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present invention also provide a computer storage medium having a computer program stored thereon. The computer program, when executed by a processor, may implement the aforementioned mobile application crowdsourced test report automated evaluation method. For example, the computer storage medium is a computer-readable storage medium.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. An automated evaluation method for crowdsourcing test reports for mobile applications is characterized by comprising the following steps:

wherein, the execution of the steps (2) and (3) is not in sequence.

2. The automated mobile application crowdsourcing test report evaluation method according to claim 1, wherein the method for eliminating invalid test reports in step (1) is as follows:

3. The automated mobile application crowdsourcing test report evaluation method according to claim 1, wherein the clustering process according to the detected defects in step (2) comprises the following steps:

(1) calculating the TF-IDF value of the test report through a TF-IDF algorithm;

(2) taking the tf-idf values of all test reports as a clustering data object O_n＝{x₁,x₂,...,x_nAre clustered, where n is the number of data objects。

4. The automated mobile application crowdsourcing test report evaluation method of claim 3, wherein: the clustering method is an MMDBK algorithm, and comprises the following specific steps:

(3) calculating a DBI clustering index;

(5) and outputting a clustering result.

5. The automated mobile application crowdsourcing test report evaluation method of claim 1, wherein: the normative indexes in the step (3) comprise: text length, readability, action words, object words, negative words, fuzzy words, and interface elements;

6. The automated mobile application crowdsourcing test report evaluation method according to claim 5, wherein the normative score is transformed by:

7. The automated mobile application crowdsourcing test report evaluation method according to claim 1, wherein the final score in step (4) is calculated by:

final score 0.7 defect rating +0.3 normative score.

8. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 7.