CN110928764A - Automated mobile application crowdsourcing test report evaluation method and computer storage medium - Google Patents

Automated mobile application crowdsourcing test report evaluation method and computer storage medium Download PDF

Info

Publication number
CN110928764A
CN110928764A CN201910957929.6A CN201910957929A CN110928764A CN 110928764 A CN110928764 A CN 110928764A CN 201910957929 A CN201910957929 A CN 201910957929A CN 110928764 A CN110928764 A CN 110928764A
Authority
CN
China
Prior art keywords
index
test report
test
defect
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910957929.6A
Other languages
Chinese (zh)
Other versions
CN110928764B (en
Inventor
姚奕
刘语婵
刘佳洛
顾晓东
杨帆
陈文科
刘伟豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN201910957929.6A priority Critical patent/CN110928764B/en
Publication of CN110928764A publication Critical patent/CN110928764A/en
Application granted granted Critical
Publication of CN110928764B publication Critical patent/CN110928764B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3692Test management for test results analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention discloses a mobile application crowdsourcing test report automatic evaluation method and a computer storage medium, wherein the method comprises the following steps of: 1) inputting a test report set and historical credibility of workers, eliminating invalid test reports, and performing word segmentation and stop word processing on the remaining test report sets; 2) clustering the test reports according to the found defects to form a plurality of types of defect test reports, and selecting the grade with the maximum proportion as the defect grade of the type of defect test report; 3) constructing a plurality of normative indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normative score of the test report; 4) and obtaining the final score of the test report according to the defect grade and the normative score. The invention overcomes the defect that the existing crowdsourcing test report quality evaluation method is lack of test report content quality evaluation, and improves the overall performance of the crowdsourcing test platform.

Description

Automated mobile application crowdsourcing test report evaluation method and computer storage medium
Technical Field
The present invention relates to automated evaluation methods and computer storage media, and more particularly, to automated evaluation methods and computer storage media for crowdsourcing test reports for mobile applications.
Background
The mobile application crowdsourcing test works by distributing test tasks of mobile application software, which are executed by employees in the past, to anonymous network users for testing through the Internet. Due to the improvement of defect discovery efficiency caused by the diversity, complementarity and the like of people in the crowdsourcing test mode, crowdsourcing tests are receiving wide attention in the industry, and a plurality of crowdsourcing test business platforms (such as Applause, baidu mtc, moocetest, Testin and the like) appear. The workings of crowd-sourced testing can help task requesters gain a large number of liberty workers who can solve practical problems by exploiting their wisdom. However, in the testing process, some malicious workers do not work seriously in order to pursue the maximum benefit of the malicious workers, the quality of the submitted testing result is low, and the testing quality is difficult to be effectively guaranteed, which may cause serious loss to task demanders. In response to this problem, many researchers have begun their research from test reports. Some studies attempt to reduce the cost of manual review by reducing the number of test reports reviewed, propose crowd-sourced test report priorities, and help developers to be able to detect test reports that reveal different defects as much as possible within limited resources and time by prioritizing the test reports using textual information and screenshot information. Still other studies have addressed the problem of fuzzy clustering of crowdsourced test reports by dividing test reports into clusters by an automated method, and developers need only review a representative test report in each cluster, greatly reducing the number of test reports reviewed.
However, these studies neglect the effect of test report quality on the efficiency of manual review. The evaluation process can not be separated from the manual technology, and automatic evaluation is realized. In response to this problem, there are also crowdsourced test report quality assessment frameworks that automatically model the quality of test reports, which typically implement quality assessment by defining a series of quantifiable metrics to measure the characteristics or attributes desired in defect reports and requirements specifications. First, NLP techniques are employed to preprocess crowdsourced test reports. The framework then defines a series of quantifiable metrics to measure the desired attributes of the test reports, and determines a numerical value for each metric based on the textual content of each test report. Finally, the numerical value of each index is converted into a nominal value (i.e., good, bad) by using a step transformation function, and the nominal values of all indexes are aggregated to predict the quality of the test report.
Although the existing crowdsourced test report quality assessment framework can measure the quality of a test report from different aspects to a certain extent through a quantifiable index, the content of the quality assessment is still only limited to the format specification of the description information of the test report, and the specific defects in the report content are not involved. Therefore, the evaluation framework is only used for evaluating the normative of the description information of the test report and lacks of evaluating the quality of the test result.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides an automatic evaluation method of a mobile application crowdsourcing test report and a computer storage medium, which solve the defect that the existing crowdsourcing test report quality evaluation method is lack of quality evaluation of test report content, evaluate the performance of workers from two aspects of the level of the found defect content and the test report normalization, can correctly measure the task completion quality of the workers, eliminate workers with sporadic attitude or malicious mining, stimulate the rational excellent workers to complete tasks with high quality, and improve the overall performance of a crowdsourcing test platform.
The technical scheme is as follows: the invention discloses an automatic evaluation method for a crowdsourcing test report of mobile application, which comprises the following steps:
(1) inputting a test report set and historical credibility of workers, eliminating invalid test reports, and performing word segmentation and stop word processing on the remaining test report sets;
(2) clustering the test reports processed in the step (1) according to the found defects to form a plurality of types of defect test reports, setting the historical credibility of workers as the grade weight of the defects for weighting, and selecting the grade with the maximum proportion as the defect grade of the type of defect test reports;
(3) constructing a plurality of normative indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normative score of the test report;
(4) obtaining a final score of the test report according to the defect grade in the step (2) and the normative score in the step (3);
wherein, the execution of the steps (2) and (3) is not in sequence.
Further, the method for eliminating the invalid test report in the step (1) comprises the following steps:
if the text length of the defect description information of the test report is less than or equal to 4, rejecting the test report; and if the test report defect description information regular matching contains ([ A ] [ P ]) | ([ N ] [ O ]) | ([ N ] [ D ].
Further, the method for clustering according to the found defects in the step (2) comprises the following steps:
(1) calculating the TF-IDF value of the test report through a TF-IDF algorithm;
(2) taking the tf-idf values of all test reports as a clustering data object On={x1,x2,...,xn-clustering, where n is the number of data objects.
Further, the clustering method is an MMDBK algorithm, and the specific steps are as follows:
(1) from data object On={x1,x2,...,xnSelecting two objects with the farthest distance;
(2) finding out all objects with the distance to the clustering center smaller than the threshold value d through neighbor searching, adding the objects into the neighbor class of the center, and recalculating the center of the neighbor class;
(3) calculating a DBI clustering index;
(4) judging whether the DBI clustering index is minimum, if not, repeating the step (2) and the step (3), if so, stopping circulation, and classifying the residual data objects into the most adjacent class;
(5) and outputting a clustering result.
Further, the normative indexes in the step (3) comprise: text length, readability, action words, object words, negative words, fuzzy words, and interface elements;
the text length index and the readability index correspond to a convex expansion measurement function, and the convex expansion measurement function is that when the index is in x2And x3In between, the index is good comment, when the index is in x1And x2Or at x3And x4When the index is less than x, the index is a middle score1Or greater than x4When the index is poor, x1、x2、x3、x4Setting parameters;
the action word index and the interface element index correspond to an increased expansion measurement function, the increased expansion measurement function is that when the index is less than or equal to 1, the index is poor, when the index is greater than 1 and less than or equal to 2, the index is medium, and when the index is greater than 2, the index is good;
the target word index and the negative word index correspond to convex measurement functions, the convex measurement functions are that when the index is less than or equal to 1 or greater than 2, the index is poor, and when the index is greater than 1 and less than or equal to 2, the index is good;
the fuzzy word index corresponds to a descending type expansion measurement function, the descending type expansion measurement function is that when the index is less than or equal to 1, the index is good evaluation, when the index is greater than 1 and less than or equal to 2, the index is medium evaluation, and when the index is greater than 2, the index is bad evaluation.
Further, the conversion method of the normative score is as follows:
and sequencing all the test reports according to the number sequence of the good comments, the medium comments and the poor comments, wherein the normative score of all the 7 indexes with good comments is max, the normative score of all the 7 indexes with poor comments is min, and the normative score of the ith test report sequenced from low to high is (max-min) i/32+ min.
Further, the method for calculating the final score in the step (4) comprises the following steps:
final score 0.7 defect rating +0.3 normative score.
The computer storage medium of the present invention, on which a computer program is stored, is characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 7.
Has the advantages that: the invention provides an automatic evaluation method of a mobile application crowdsourcing test report based on a clustering algorithm and normalization measurement, which can realize automatic evaluation of the crowdsourcing test report from two aspects of report content and report normalization, can not be limited to format and normalization factors of the test report, and also takes specific defect levels in the report content into consideration to comprehensively evaluate the normalization and the quality of test results in the test report, thereby ensuring that a crowdsourcing test platform objectively measures the quality of the crowdsourcing test worker completing test tasks from multiple aspects, and realizing automatic calculation in the whole evaluation process, so that the evaluation process is separated from manual technology, and the efficiency of examining the test report is greatly improved.
Drawings
FIG. 1 is an overall flowchart of the method in the present embodiment;
FIG. 2 is a diagram of four metric functions of the normative index;
fig. 3 is a diagram illustrating four extended metric functions of the normative index.
Detailed Description
According to the specific implementation mode of the invention, firstly, the crowdsourcing test report is preprocessed, and then, starting from the two aspects of the severity of the defect and the normalization of the test report, on one hand, the number of the types of the defect is determined by clustering the MMDBK algorithm of the test report, the severity grade of each type is evaluated according to the historical credibility weight of a worker, on the other hand, the normalization score of each test report is calculated by describing the normalization measurement index and the discrete measurement function through the report, and finally, the score finally obtained by each test report is calculated by integrating the two aspects. The method flow of the embodiment of the invention is shown in fig. 1, and the mobile application crowdsourcing test report set TR and the worker historical credibility GU are input, and the method comprises the following steps:
step1, inputting a test report set and historical credibility of workers, eliminating invalid test reports through a filtering rule, directly rating the reports as 0 score, and performing word segmentation and word deactivation processing on the remaining valid reports.
Due to the inertia and welfare of people, it is considered that some workers reduce their workload in order to obtain more profits. Therefore, the algorithm needs to screen out the inferior or invalid test reports to ensure the quality of the test reports during the later report analysis. Through the analysis of a plurality of test report data sets, it is found that some special test reports exist in the test report sets, and the following 2 special statements exist in the description information field of the Bug to be analyzed: (1) short sentence: the Bug describes that a column is empty or contains only a few words without any readability description. Only a few test steps are described; (2) pseudo-sentences: in the Bug description column, the test case is mainly indicated to be successfully executed, no defect exists, and the test passes. Test reports containing one of the two above would be invalid test reports and they would not contain any information about software defects. For example, a Bug description field is empty or a Bug description field is 1 in length, this means that the test report is meaningless. If the test report Bug description field is 'Bug not found', the test is passed. Therefore, in order to improve the processing efficiency of the test report data set, it is necessary to filter out the invalid test reports before processing them.
Through analysis and summarization, on one hand, the test report description information is found to be too little and can be directly filtered. On the other hand, description information of a test report containing a pseudo statement mostly consists of statement statements, and is randomly sampled by 10%, and the description information is generally found to contain special character strings, such as "test pass", "execution success", "not found", "no Bug", and the like. From this, the following two filter rules can be concluded:
(1) if the text length described by the Bug is less than or equal to 4, filtering the test report;
(2) if the Bug's profile canonical match can contain ([ A ] [ P ]) | ([ N ] [ O ]) | ([ N ] [ D ]. Wherein A is an action word, P is a positive word, N is a negative word, O is an object word, D is an action word, Q is a quantitative word, and specific words contained in the action words and the Q are shown in Table 1.
TABLE 1 regular sentence glossary
Figure BDA0002227976990000041
After filtering out invalid test reports, the test reports need to be further preprocessed, and because the test reports are formed by Chinese natural language, the test reports are processed by adopting NLP technology, which mainly comprises word segmentation and stop word removal, wherein the stop word removal is to remove stop words, namely to remove meaningless words in the reports. In view of the difficulty of Chinese character segmentation, the present document performs the processing by means of a specialized Chinese NLP tool NLPIR segmentation tool in python.
And 2, solving word frequency and inverse text frequency indexes of the effective report set by using a TF-IDF algorithm, taking the word frequency and the inverse text frequency indexes as Index objects of clustering, and clustering the contents of the test reports by using an MMDBK (Max-Min and Davies-Bouldin Index based K-means) clustering algorithm, wherein the same type of test reports report the same defect.
Because the test reports correspond to the found defects one by one, the method adopts a method of directly clustering the test reports, the TF-IDF value of the test report after pretreatment is calculated through a TF-IDF algorithm, the TF-IDF value represents word frequency and an inverse text frequency index, and the TF-IDF values of all the test reports are used as a clustering data object On={x1,x2,...,xnAnd (5) clustering the test reports by utilizing an MMDBK algorithm.
The MMDBK algorithm is a clustering algorithm improved aiming at the defects of the K-means algorithm, the determination of the number K of clusters in the K-means algorithm and the selection of K clustering centers are improved, the optimal number of clusters is determined and new clustering centers are selected by using Davies-Bouldin Index (abbreviated as DBI) clustering indexes and a maximum-minimum distance method so as to ensure that the clusters have smaller similarity, and the implementation flow is as follows:
step 1: from n data objects On={x1,x2,...,xnSelecting two objects x with the farthest distance1And x2(ii) a In order to avoid the problem that the clustering boundary is fuzzy due to the fact that the K-means algorithm is too close to the clustering center when the clustering center is selected, the two objects with the farthest distance are selected to form two initial clusters, and therefore low similarity between different classes in subsequent calculation is guaranteed.
step 2: finding out all objects which are less than the threshold value d from the cluster center through neighbor searching, adding all the objects into the neighbor class of the center, and recalculating the center of the neighbor class.
After the initial two clustering centers are found and the class centers are updated, determining the number of clusters and finding the remaining K-2 clustering centers, so that the lowest similarity among the clustering centers is the key of the algorithm, and the method specifically comprises the following steps: known as c1,c2Two initial clustering centers, respectively calculating the residual objects to c1And c2Distance D ofj1And Dj2If D isk=max{min(Dj1,Dj2) J is 1,2, n, and Dk>θ·D12,D12Is c1And c2Then take xjAs the 3 rd cluster center, c3=xj. If c is3If so, calculate Dk=max{min(Dj1,Dj2,Dj3) J is 1,2, n, if Dk>θ·D12Then finding the 4 th clustering center; and so on until Dk>θ·D12If not, finishing searching the clustering center.
step 3: calculating and updating DBI value, comparing the DBI value with the DBI value of the previous round (comparing the first round with the initial value), and if the DBI value is less than or equal to the value of the previous round, meeting the circulation condition
And (4) clustering indexes of Davies-Bouldin Index (DBI) are used for evaluating clustering results. The DBI clustering index is a non-fuzzy cluster evaluation index and mainly takes two factors of the separation degree between classes and the cohesion degree in the classes as the basis, namely, the dissimilarity between different classes is high, and the similarity of data objects in the same class is high. When the distance between the data objects in the class is smaller and the distance between the classes is larger, the DBI value is smaller, and the clustering result under the clustering number is optimal. The calculation formula of the distance between classes and the distance in the classes is as follows:
Figure BDA0002227976990000061
Figure BDA0002227976990000062
where x denotes a data object within the ith class, viRepresenting the centroid of the ith class, CiRepresents the number of data objects in the ith class, | | · | | represents the Euclidean distance, SiData objects and centroids v in the ith classiStandard error of (d)i,jRepresenting the euclidean distance between the ith class and the jth class centroid. The Davies-Bouldin Index formula is:
Figure BDA0002227976990000063
wherein SiIndicates the intra-class distance, S, of the ith classjDenotes the intra-class distance of the jth class, di,jThe distance between the i and the j-th class is shown, and K represents the number of clusters.
The good clustering result should be that the intra-class distance between the same classes is small, the inter-class distance is large, and the condition can be satisfied, that is, the smaller the numerator is, the larger the denominator is, the smaller the DBI value is, that is, the optimal clustering number can be obtained by the value.
step 4: if the condition is met, searching a new clustering center, and repeating the step (2-3);
step 5: if the condition is not met, stopping circulation, and classifying the residual data objects into the most adjacent class;
step 6: and outputting a clustering result.
After clustering is completed, each class can be regarded as a test report set of a certain defect of the software to be tested, and for one defect, evaluation is carried out according to the grade of the test report filled by a worker. The test report of a worker with high test history reliability is considered preferentially, normalization processing is carried out according to the history reliability of the worker participated by the platform, the test report is used as a weight coefficient to respectively calculate the light, general, serious and fatal proportion coefficient of the defects in the test report set, and the defect grade with the highest proportion is selected as the final grade of the defects.
The magnitude of the impact is defined as the severity of the software defect and is summarized in four levels:
(1) and (3) light: some small defects such as wrongly written characters, poor character typesetting and the like have little influence on functions and can be normally used.
(2) In general: less serious errors such as lack of implementation of the secondary functional part, poor user interface, and long response time.
(3) Severe: the main function module is not realized, the main function is partially lost, and the secondary function is completely lost.
(4) Fatal: the finger can cause system crash, or cause data loss, complete loss of primary function, etc.
The whole test report data set is clustered into N types of defects which are respectively a test report set Cla1、Cla2、……、ClaNEach class contains 4 levels of test reports, where the set of each level of each class contains mi,j(j ═ 1,2,3,4 for light, normal, severe, and crash, respectively) test reports. Clai,j,kThe kth test report representing the jth level in the ith defect, and the historical credibility of the worker who submitted the test report is Ui,j,kAnd (3) expressing the contribution degree of the submitter of the ith and jth test report, wherein TR is the number of all test reports, namely the corresponding number of workers. How to carry out standard normalization on the contribution degree, and solving the average value of historical credibility of all users as follows:
Figure BDA0002227976990000071
the variance of the user historical contribution values is:
Figure BDA0002227976990000072
after normalization, the method comprises the following steps:
Figure BDA0002227976990000073
respectively calculating the light, general, serious and fatal proportionality coefficients aiming at each type of defects
Figure BDA0002227976990000074
Where δ is the test report coefficient, typically set to 0.8.
We chose Bi=max(Bi,1,Bi,2,Bi,3Bi,4) The assigned grade is used as the grade of the i-th type defect. And simultaneously setting value ratios of each grade, wherein the grades are respectively 10, 7.5, 5 and 2.5 in fatal, serious, general and light scores.
And 3, for each type of test report, setting the historical credibility of the existing workers as the grade weight of the test report, determining the proportion of each grade by combining the quantity of each type of defect, selecting the grade with the maximum proportion as the final grade of the type of defect, and obtaining the corresponding defect grade score according to the grade setting.
And 4, starting from the report description normalization, defining 7 quantifiable normalization indexes and corresponding step type measurement functions, converting the quantitative values of the indexes into quality grade through the measurement functions, and obtaining corresponding normalization scores by utilizing a linear interpolation method according to the quality quantity of the 7 indexes, namely 'good', 'medium', 'poor'.
The normalization of the test report reflects the capability and attitude of crowdsourcing test workers in completing tasks, and is one of the factors for evaluating the test report of the workers, so in order to evaluate the normalization of the test report more accurately, the method constructs 7 quantifiable indexes to measure the quality of the test report, and mainly evaluates defect description information or test steps:
(1) text length: the text length refers to the number of Chinese characters contained in the defect description information in the test report, and the test report with the text length kept at a proper value is good in quality.
(2) Readability: the reading difficulty of the text can be measured, and the measurement formula is Y-14.9596X1+39.07746X2-2.48X3Wherein X is1Denotes the ratio of difficult words, X2Representing the number of sentences, X3Representing the average number of strokes of a Chinese character.
(3) Action words: when describing defects, sequences of actions are often described in the testing step, and these actions are the key to the test worker triggering the interface or interface events of the software. Therefore, attention needs to be paid to action words in the test report, such as "open", "click", "exit", and the like.
(4) The object word: when a tester finds a Bug in software, they describe the behavior with words that can reflect system errors, such as "problem", "Bug", and so on.
(5) Negative words: when the tester finds the software defect, they will use negative words to describe the system function missing, such as "missing", "failed", etc.
(6) Fuzzy words: during testing, a tester, if encountering ambiguous or ambiguous defects, prefers to use a somewhat ambiguous vocabulary for description, which may make understanding of the test report difficult. Ambiguous words include "almost", "few", "possibly", "generally", etc.
(7) Interface elements: the mobile application software interface is composed of a plurality of interactive components, corresponding components need to be clicked, input, slid and the like during software testing, and interface elements such as buttons, sliders and the like are certainly included when action sequences are described in the testing step.
Considering that the length of the test step is determined according to the software complexity and cannot be measured by using a proper text length, the defect description information is separately quantized by using the text length, and the defect description information and the test step are jointly quantized by using the remaining 6 indexes aiming at the word segmentation text which is preprocessed. Finally, a 7-dimensional index vector is generated for each test report, and each test report can be represented by a 7-dimensional vector. The evaluated quality values are expressed as "good score", "medium score" and "bad score", and a metric function is constructed to convert the continuous values into discrete values.
The metric function is divided into four categories: growth metric function, descent metric function, convex metric function, concave metric function, as shown in fig. 2:
(1) growth metric function: when the index is less than x1When the index is greater than x, the index is poor1When the index is high, the index is good.
(2) A falling metric function: when the index is less than x1When the index is larger than x, the index is good comment1In the case of the above, the index was regarded as poor score.
(3) Convex metric function: when the index is at x1And x2When the index is less than x, the index is good1Or greater than x2In the case of the above, the index was regarded as poor score.
(4) Concave metric function: when the index is at x1And x2In between, the index is poor, when the index is less than x1Or greater than x2When the index is high, the index is good. However, the above-mentioned measurement function can only be divided into good and bad discrete values, and the measurement function needs to be expanded to increase the parameter x of the boundary2、x3、x4So that the extended metric function can be divided into three discrete values of good, medium and difference, and a schematic diagram thereof is shown in fig. 3.
In table 2, the parameter intervals are expressed in the form of 0-a-b- ∞ for the increasing and decreasing metric functions, and are expressed as three intervals. For the convex and concave metric functions, the parameter intervals are represented using the form 0-a-b-c-d- ∞, as five intervals.
TABLE 2 metric function of evaluation index
Figure BDA0002227976990000091
The most suitable length of the text in the test report should be 15-30, and too long or too short text will affect the quality of the test report, so the readability is also evaluated by the convex expansion metric function, and the specific parameters can be obtained by experimental debugging. The indexes of the object words and the negative words are expressed by definite numbers, if the text contains more than 0 or 2 object words or negative words, the test report is considered to be low quality, if only 1 error word or negative word is reported, the test report is considered to be high quality, and the test report is based on the convex measurement function of the prototype. Action words and interface elements are scaled by an incremental expansion metric function, i.e., if more words are included, the better the quality of the test report on the index is considered. Only the fuzzy word index belongs to the descending type expansion measurement function, and the more fuzzy words, the worse the quality of the test report.
The parameters set by the measurement function are classified, so that the 'good evaluation', 'medium evaluation' and 'poor evaluation' of the test report can be obtained according to the indexes, in order to summarize the evaluations of a plurality of indexes, three different evaluations correspond to different score grades, and invalid test reports are all set to be 0 score; the optimal test report, if 7 good scores are obtained, the score is 10; the worst valid test report, 7 "bad scores" were set to a base of 1 point, with evenly distributed scores in the middle. Namely, the method for converting the 7 index qualities into the normative scores is a linear interpolation method. The quality of 5 indexes can be 'good', 'medium', 'poor', the quality of 2 indexes can be 'good' and 'poor', finally, the combination of the quantity of 7 indexes 'good', 'medium', 'poor' is summarized, and 33 quality evaluation results exist in total. The scores of the 33 results were determined by linear interpolation, with the highest 7 good scores being max, here 10, and the lowest 7 bad scores being min, here 1, and the score of the ith result from low to high in all quality evaluations being (max-min) × i/32+ min. The final quality score that can be used to test the report normalization is shown in table 3.
TABLE 3 normative scores of test reports
Figure BDA0002227976990000101
And 5, obtaining the final score of the test report through weighting and summing the defect grade and the normative score. The defect level score weight is 0.7 and the normative score weight is 0.3.
The pseudo code realized by the program of the method is as follows:
inputting: test report set, worker historical credibility
CTRAEA(TR,GU)
1for i in range (n)// pretreatment stage
2if TRiCompliance with Filter rules
3mTRi=0,delete TRiV/invalid report is scored as 0 and culled
4 number of statistical invalid reports// this number evaluates the accuracy of the filtering rule
5newTR = split (TR)// for all active report participles, stop word
6CN = Cluster (newTR)// Cluster evaluation defect rating
7for i in range (N) traverse each type of defect
8for j in range (4)// go through each level of such defects
9for k in range(mi,j) // all test reports traversing this level
Figure BDA0002227976990000102
11Bi=max(Bi,1,Bi,2,Bi,3Bi,4) V/determining Defect grade score
12DGi=ratio(Bi) V/determining Defect grade score
13for i in range (m)// test report normalization metric
14ZBi=newTERQAF(newTRi)
15QGi=search(ZBi)
16Rwi=a*DGi+(1-a)*QGiV/weighted summation of Defect grade score and normative score, a is typically 0.7
17 output mTRiAnd Rwi
The implementation effect of the invention is verified as follows, and the used experimental data set is acquired from a Kibug public-testing platform, which stands in 2012 and is a platform for distributing, collecting and analyzing crowdsourcing tasks. 4 test tasks are collected, namely drawing music, shopping, internet cloud and excellent podcasts respectively.
The embodiment mainly extracts several columns with attributes of "device version", "network", "level", "Bug description" and "test step" as the research keywords. When the 4 crowdsourcing test tasks are ended, most defects are detected, each submitted test report is audited by a manager and is marked with a label of "valid" and "invalid", the annotating person also records the number of the defects of each mobile application software, and the annotation result is counted in the following table 4.
TABLE 4 test report annotation results
Figure BDA0002227976990000111
In total, 1380 test reports are collected, including 291 music pictures, 408 shopping pictures, 238 internet clouds, and 443 podcasts. The infinite number of test reports in the four test report sets are 61, 193, 149, and 238, respectively.
In the preprocessing stage of this embodiment, all test reports are used as input data of a filter and run to screen out valid and invalid test reports, where the number of the labeled invalid test reports is 61, 193, 149, and 238, after filtering by two rules, 61, 189, 147, and 232 invalid test reports are correctly filtered, and the filtering accuracy is 97.48% -100%, which indicates that the filtering rule is valid, but the number of the incorrectly filtered reports is greater than that of the incorrectly filtered reports, and after manual inspection, it is found that a mainly valid test report statement contains a "no" word, and can be regularly matched, and thus filtered.
After the preprocessing is completed, the test reports are clustered and the grade of each type of defect is determined through the MMDBK algorithm, and the accuracy of the defect evaluation and the marking result is compared in a table 5.
TABLE 5 evaluation results of Defect grade
Figure BDA0002227976990000121
In table 5, the accuracy of the defect level evaluation results is about 90%, and the highest accuracy reaches 93.65%, and the results show that the accuracy is higher by comprehensively evaluating the submitter capability and the defect level parameters after clustering.
In the text length index for quality assessment, to determine the parameter x1、x2、x3And x4The specific numerical values of the three parameters are fixed by a control variable method, the value of one parameter is gradually increased, the precision of the prediction result is compared by utilizing the evaluation index of the relative error, and the optimal values of the 4 parameters of the text length are respectively 9, 15, 23 and 32. The readability parameters are likewise optimally taken to be-5, -1, 6 and 12. And finally scoring by using the parameters obtained above to obtain a comprehensive score of the importance and the normalization of the test report. The relative error of each test report score was calculated and averaged to determine the error of the software evaluation for this method, as shown in table 6 below.
TABLE 6 Final score accuracy index results
Figure BDA0002227976990000122
From experimental data, the average relative error of the score and the annotation score of the evaluation test report of the method in the embodiment is 9.24%, and the average relative error of 4 software is not more than 10%, so that the accuracy and the efficiency of the method are shown. The invention can be applied to a test report scoring mechanism of the crowdsourcing test platform, and can automatically evaluate the quality score of the test report in the process, thereby ensuring that the crowdsourcing test platform objectively measures the quality of the crowdsourcing test worker for completing the test task from multiple aspects, reducing the expert cost required by the crowdsourcing test platform and increasing the commercial benefit. For example, when mobile application software is tested and evaluated, the method can effectively evaluate efficiency, reduce evaluation cost, evaluate objective evaluation from both report content and report normalization, and improve accuracy and reliability of evaluation results.
The embodiments of the present invention, if implemented in the form of software functional modules and sold or used as independent products, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
Accordingly, embodiments of the present invention also provide a computer storage medium having a computer program stored thereon. The computer program, when executed by a processor, may implement the aforementioned mobile application crowdsourced test report automated evaluation method. For example, the computer storage medium is a computer-readable storage medium.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (8)

1. An automated evaluation method for crowdsourcing test reports for mobile applications is characterized by comprising the following steps:
(1) inputting a test report set and historical credibility of workers, eliminating invalid test reports, and performing word segmentation and stop word processing on the remaining test report sets;
(2) clustering the test reports processed in the step (1) according to the found defects to form a plurality of types of defect test reports, setting the historical credibility of workers as the grade weight of the defects for weighting, and selecting the grade with the maximum proportion as the defect grade of the type of defect test reports;
(3) constructing a plurality of normative indexes and corresponding step type measurement functions, evaluating the test report, and converting the evaluation into a normative score of the test report;
(4) obtaining a final score of the test report according to the defect grade in the step (2) and the normative score in the step (3);
wherein, the execution of the steps (2) and (3) is not in sequence.
2. The automated mobile application crowdsourcing test report evaluation method according to claim 1, wherein the method for eliminating invalid test reports in step (1) is as follows:
if the text length of the defect description information of the test report is less than or equal to 4, rejecting the test report; and if the test report defect description information regular matching contains ([ A ] [ P ]) | ([ N ] [ O ]) | ([ N ] [ D ].
3. The automated mobile application crowdsourcing test report evaluation method according to claim 1, wherein the clustering process according to the detected defects in step (2) comprises the following steps:
(1) calculating the TF-IDF value of the test report through a TF-IDF algorithm;
(2) taking the tf-idf values of all test reports as a clustering data object On={x1,x2,...,xnAre clustered, where n is the number of data objects。
4. The automated mobile application crowdsourcing test report evaluation method of claim 3, wherein: the clustering method is an MMDBK algorithm, and comprises the following specific steps:
(1) from data object On={x1,x2,...,xnSelecting two objects with the farthest distance;
(2) finding out all objects with the distance to the clustering center smaller than the threshold value d through neighbor searching, adding the objects into the neighbor class of the center, and recalculating the center of the neighbor class;
(3) calculating a DBI clustering index;
(4) judging whether the DBI clustering index is minimum, if not, repeating the step (2) and the step (3), if so, stopping circulation, and classifying the residual data objects into the most adjacent class;
(5) and outputting a clustering result.
5. The automated mobile application crowdsourcing test report evaluation method of claim 1, wherein: the normative indexes in the step (3) comprise: text length, readability, action words, object words, negative words, fuzzy words, and interface elements;
the text length index and the readability index correspond to a convex expansion measurement function, and the convex expansion measurement function is that when the index is in x2And x3In between, the index is good comment, when the index is in x1And x2Or at x3And x4When the index is less than x, the index is a middle score1Or greater than x4When the index is poor, x1、x2、x3、x4Setting parameters;
the action word index and the interface element index correspond to an increased expansion measurement function, the increased expansion measurement function is that when the index is less than or equal to 1, the index is poor, when the index is greater than 1 and less than or equal to 2, the index is medium, and when the index is greater than 2, the index is good;
the target word index and the negative word index correspond to convex measurement functions, the convex measurement functions are that when the index is less than or equal to 1 or greater than 2, the index is poor, and when the index is greater than 1 and less than or equal to 2, the index is good;
the fuzzy word index corresponds to a descending type expansion measurement function, the descending type expansion measurement function is that when the index is less than or equal to 1, the index is good evaluation, when the index is greater than 1 and less than or equal to 2, the index is medium evaluation, and when the index is greater than 2, the index is bad evaluation.
6. The automated mobile application crowdsourcing test report evaluation method according to claim 5, wherein the normative score is transformed by:
and sequencing all the test reports according to the number sequence of the good comments, the medium comments and the poor comments, wherein the normative score of all the 7 indexes with good comments is max, the normative score of all the 7 indexes with poor comments is min, and the normative score of the ith test report sequenced from low to high is (max-min) i/32+ min.
7. The automated mobile application crowdsourcing test report evaluation method according to claim 1, wherein the final score in step (4) is calculated by:
final score 0.7 defect rating +0.3 normative score.
8. A computer storage medium having a computer program stored thereon, characterized in that: the computer program, when executed by a computer processor, implementing the method of any one of claims 1 to 7.
CN201910957929.6A 2019-10-10 2019-10-10 Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium Active CN110928764B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910957929.6A CN110928764B (en) 2019-10-10 2019-10-10 Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910957929.6A CN110928764B (en) 2019-10-10 2019-10-10 Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium

Publications (2)

Publication Number Publication Date
CN110928764A true CN110928764A (en) 2020-03-27
CN110928764B CN110928764B (en) 2023-08-11

Family

ID=69848814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910957929.6A Active CN110928764B (en) 2019-10-10 2019-10-10 Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium

Country Status (1)

Country Link
CN (1) CN110928764B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784637A (en) * 2018-12-13 2019-05-21 华为终端有限公司 Method and apparatus applied to the analysis of processing platform data
CN111815167A (en) * 2020-07-09 2020-10-23 杭州师范大学 Automatic crowdsourcing test performance assessment method and device
CN112416780A (en) * 2020-11-25 2021-02-26 南京大学 Crowdsourcing test report processing and classifying method
CN112434518A (en) * 2020-11-30 2021-03-02 北京师范大学 Text report scoring method and system
CN112527611A (en) * 2020-09-24 2021-03-19 上海趣蕴网络科技有限公司 Product health degree assessment method and system
CN113743096A (en) * 2020-05-27 2021-12-03 南京大学 Crowdsourcing test report similarity detection method based on natural language processing
CN113780366A (en) * 2021-08-19 2021-12-10 杭州电子科技大学 Crowd-sourced test report clustering method based on AP (Access Point) neighbor propagation algorithm
US11386299B2 (en) 2018-11-16 2022-07-12 Yandex Europe Ag Method of completing a task
US11416773B2 (en) 2019-05-27 2022-08-16 Yandex Europe Ag Method and system for determining result for task executed in crowd-sourced environment
US11475387B2 (en) 2019-09-09 2022-10-18 Yandex Europe Ag Method and system for determining productivity rate of user in computer-implemented crowd-sourced environment
US11481650B2 (en) 2019-11-05 2022-10-25 Yandex Europe Ag Method and system for selecting label from plurality of labels for task in crowd-sourced environment
US11727329B2 (en) 2020-02-14 2023-08-15 Yandex Europe Ag Method and system for receiving label for digital task executed within crowd-sourced environment
US11727336B2 (en) 2019-04-15 2023-08-15 Yandex Europe Ag Method and system for determining result for task executed in crowd-sourced environment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804319A (en) * 2018-05-29 2018-11-13 西北工业大学 A kind of recommendation method for improving Top-k crowdsourcing test platform tasks
JP2018194810A (en) * 2017-05-15 2018-12-06 ネイバー コーポレーションNAVER Corporation Device controlling method and electronic apparatus
CN109670727A (en) * 2018-12-30 2019-04-23 湖南网数科技有限公司 A kind of participle mark quality evaluation system and appraisal procedure based on crowdsourcing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018194810A (en) * 2017-05-15 2018-12-06 ネイバー コーポレーションNAVER Corporation Device controlling method and electronic apparatus
CN108804319A (en) * 2018-05-29 2018-11-13 西北工业大学 A kind of recommendation method for improving Top-k crowdsourcing test platform tasks
CN109670727A (en) * 2018-12-30 2019-04-23 湖南网数科技有限公司 A kind of participle mark quality evaluation system and appraisal procedure based on crowdsourcing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈信: "众包测试报告的挖掘与评估" *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11386299B2 (en) 2018-11-16 2022-07-12 Yandex Europe Ag Method of completing a task
CN109784637A (en) * 2018-12-13 2019-05-21 华为终端有限公司 Method and apparatus applied to the analysis of processing platform data
US11727336B2 (en) 2019-04-15 2023-08-15 Yandex Europe Ag Method and system for determining result for task executed in crowd-sourced environment
US11416773B2 (en) 2019-05-27 2022-08-16 Yandex Europe Ag Method and system for determining result for task executed in crowd-sourced environment
US11475387B2 (en) 2019-09-09 2022-10-18 Yandex Europe Ag Method and system for determining productivity rate of user in computer-implemented crowd-sourced environment
US11481650B2 (en) 2019-11-05 2022-10-25 Yandex Europe Ag Method and system for selecting label from plurality of labels for task in crowd-sourced environment
US11727329B2 (en) 2020-02-14 2023-08-15 Yandex Europe Ag Method and system for receiving label for digital task executed within crowd-sourced environment
CN113743096A (en) * 2020-05-27 2021-12-03 南京大学 Crowdsourcing test report similarity detection method based on natural language processing
CN111815167A (en) * 2020-07-09 2020-10-23 杭州师范大学 Automatic crowdsourcing test performance assessment method and device
CN112527611A (en) * 2020-09-24 2021-03-19 上海趣蕴网络科技有限公司 Product health degree assessment method and system
CN112416780B (en) * 2020-11-25 2022-03-25 南京大学 Crowdsourcing test report processing and classifying method
CN112416780A (en) * 2020-11-25 2021-02-26 南京大学 Crowdsourcing test report processing and classifying method
CN112434518A (en) * 2020-11-30 2021-03-02 北京师范大学 Text report scoring method and system
CN112434518B (en) * 2020-11-30 2023-08-15 北京师范大学 Text report scoring method and system
CN113780366A (en) * 2021-08-19 2021-12-10 杭州电子科技大学 Crowd-sourced test report clustering method based on AP (Access Point) neighbor propagation algorithm
CN113780366B (en) * 2021-08-19 2024-02-13 杭州电子科技大学 Crowd-sourced test report clustering method based on AP neighbor propagation algorithm

Also Published As

Publication number Publication date
CN110928764B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
CN110928764A (en) Automated mobile application crowdsourcing test report evaluation method and computer storage medium
US11449673B2 (en) ESG-based company evaluation device and an operation method thereof
CN108073568B (en) Keyword extraction method and device
Hartson et al. Criteria for evaluating usability evaluation methods
Hartson et al. Criteria for evaluating usability evaluation methods
CN110188047B (en) Double-channel convolutional neural network-based repeated defect report detection method
CN111738589B (en) Big data item workload assessment method, device and equipment based on content recommendation
CN111090735B (en) Performance evaluation method of intelligent question-answering method based on knowledge graph
KR20190110084A (en) Esg based enterprise assessment device and operating method thereof
CN111343147A (en) Network attack detection device and method based on deep learning
CN110750978A (en) Emotional tendency analysis method and device, electronic equipment and storage medium
CN111309577B (en) Spark-oriented batch application execution time prediction model construction method
CN112989621A (en) Model performance evaluation method, device, equipment and storage medium
TW201416887A (en) Methods for sentimental analysis of news text
CN107480126B (en) Intelligent identification method for engineering material category
CN111654853B (en) Data analysis method based on user information
JP2008282111A (en) Similar document retrieval method, program and device
CN113962565A (en) Project scoring method and system based on big data and readable storage medium
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN113792141A (en) Feature selection method based on covariance measurement factor
KR102155692B1 (en) Methods for performing sentiment analysis of messages in social network service based on part of speech feature and sentiment analysis apparatus for performing the same
CN113901203A (en) Text classification method and device, electronic equipment and storage medium
CN112732549A (en) Test program classification method based on cluster analysis
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
CN112465009B (en) Method for positioning software crash fault position

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant