CN113535549A - Test data expansion method, device, equipment and computer readable storage medium - Google Patents

Test data expansion method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN113535549A
CN113535549A CN202110691032.0A CN202110691032A CN113535549A CN 113535549 A CN113535549 A CN 113535549A CN 202110691032 A CN202110691032 A CN 202110691032A CN 113535549 A CN113535549 A CN 113535549A
Authority
CN
China
Prior art keywords
data
expansion
service
service class
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110691032.0A
Other languages
Chinese (zh)
Inventor
范超超
于超敏
王思睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202110691032.0A priority Critical patent/CN113535549A/en
Publication of CN113535549A publication Critical patent/CN113535549A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3676Test management for coverage analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a test data expansion method, a test data expansion device, test data expansion equipment and a computer readable storage medium, wherein the test data expansion method comprises the following steps: acquiring an original data set; wherein the raw data set comprises test data for at least one traffic class; carrying out service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service class; aiming at each service class, performing data expansion on an original data subset of the service class by using a data expansion strategy of the service class to obtain a first expanded data subset of the service class; and merging the first expansion data subsets of the service classes to obtain an expansion data set. By the scheme, the efficiency and the quality of data acquisition can be effectively improved.

Description

Test data expansion method, device, equipment and computer readable storage medium
Technical Field
The present application relates to the field of intelligent interaction technologies, and in particular, to a test data expansion method, an electronic device, and a computer-readable storage medium.
Background
At present, the intelligent language interaction engine is widely applied and covers various services and related items such as intelligent customer service, auxiliary marketing, navigation and outbound. At present, the acquisition mode of the early-stage data of each project is single, a small number of data samples are provided by a client side, and then research and development tests are distributed according to a certain proportion for optimization and testing. The inside of research and test is sometimes subjected to manual data expansion, but different project requirements are different, characteristics and types of business data are different, and the expansion is difficult and time-consuming.
The existing data acquisition mode and manual expansion mode have many defects. Firstly, the amount of data provided by a client side is very small, and the data requirements of research and test cannot be met far away, so that the research and development optimization is insufficient, the test coverage is incomplete, many problems are not fully exposed, the scene generalization of an engine is insufficient, and the like. And secondly, manual expansion according to a small amount of data requires manual selection of an expansion mode according to different service data types, so that manpower is wasted.
Disclosure of Invention
The application provides an expansion method of test data, electronic equipment and a computer readable storage medium, which can effectively improve the efficiency and quality of data acquisition.
In order to solve the above problem, a first aspect of the present application provides an expansion method of test data, the expansion method including: acquiring an original data set; wherein the raw data set comprises test data for at least one traffic class; carrying out service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service class; aiming at each service class, performing data expansion on an original data subset of the service class by using a data expansion strategy of the service class to obtain a first expanded data subset of the service class; and merging the first expansion data subsets of the service classes to obtain an expansion data set.
In order to solve the above problem, a second aspect of the present application provides an expansion device for test data, including: a data acquisition module for acquiring an original data set; wherein the raw data set comprises test data for at least one traffic class; the data classification module is used for carrying out service classification on the original data set by utilizing a first classification model to obtain an original data subset corresponding to each service class; the data expansion module is used for carrying out data expansion on the original data subset of the service class by utilizing the data expansion strategy of the service class aiming at each service class to obtain a first expanded data subset of the service class; and the data merging module is used for merging the first expansion data subsets of the service classes to obtain an expansion data set.
In order to solve the above problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, wherein the processor is configured to execute program instructions stored in the memory to implement the method for expanding test data of the first aspect.
In order to solve the above problem, a fourth aspect of the present application provides a computer-readable storage medium on which program instructions are stored, the program instructions, when executed by a processor, implementing the method for augmenting test data of the first aspect described above.
The invention has the beneficial effects that: different from the situation of the prior art, the method includes the steps of obtaining an original data set, wherein the original data set comprises test data of at least one service type, performing service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service type, performing data expansion on the original data subset of each service type by using a data expansion strategy of each service type to obtain a first expanded data subset of each service type, and merging the first expanded data subsets of each service type to obtain an expanded data set. Therefore, only a small amount of test data needs to be input, the service type judgment can be carried out on each piece of input test data, after different service types are classified, automatic test data expansion construction can be carried out by combining preset data expansion strategies of corresponding service types based on the set required expansion amount, the quality of test data acquisition can be effectively improved, the efficiency of test data acquisition is greatly improved, and the generalization and the reliability of the tested engine can be further improved.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating a first embodiment of a method for expanding test data according to the present application;
FIG. 2 is a flowchart illustrating an embodiment of a method for training the first classification model in the method for expanding test data according to the present application;
FIG. 3 is a flowchart illustrating an embodiment of step S13 in FIG. 1;
FIG. 4 is a flowchart illustrating an embodiment of step S131 in FIG. 3;
FIG. 5 is a flowchart illustrating an embodiment of step S1312 in FIG. 4;
FIG. 6 is a flow chart illustrating an application scenario of the method for expanding test data according to the present application;
FIG. 7 is a block diagram of an embodiment of an expansion device for test data according to the present application;
FIG. 8 is a block diagram of an embodiment of an electronic device of the present application;
FIG. 9 is a block diagram of an embodiment of a computer-readable storage medium of the present application.
Detailed Description
The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.
The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a first embodiment of a method for expanding test data according to the present application. Specifically, the method for expanding test data of the present embodiment may include the following steps:
step S11: acquiring an original data set; wherein the raw data set comprises test data for at least one traffic class.
In the early stage of application of the intelligent language interaction engine, test data is required to be used for testing and optimizing, and because each piece of test data reflects different scenes and conditions, the problem of incomplete test coverage can occur if the number of the test data is small, so that a lot of problems are not fully exposed, and the scene generalization of the engine is insufficient. The raw data set may be provided by the client side or may be formed by manual expansion by development testers. However, in training a voice interaction model, especially in different application scenarios, due to different expression habits among different scenarios, a data source cannot be used universally, so that data accumulation in some application scenarios is less, and the cost of manually labeling data is very high, so that it is necessary to provide an expansion method of test data to expand an acquired original data set.
Specifically, the test data in the original data set may be interaction-type data, for example, the test data may include a test question and a return value or a syntax file corresponding to the test question, a service class name, and the like. The service category is mainly used for distinguishing services to which the test data belongs, such as services of text customer service, auxiliary marketing, navigation, outbound call and the like.
Step S12: and carrying out service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service class.
In the original data set, there are test data determined by the service class attribute, that is, the service class to which the test data belongs can be directly determined by the numerical value of part of the service index of the test data. Such test data is often judged to be outliers by having too large or too small a value on one or some of the criteria. The method comprises the steps of utilizing a first classification model to classify services of an original data set, specifically, setting an extraction rule according to part of service indexes of test data, extracting the test data with definite service class attributes according to the extraction rule, and determining a service classification result according to a clustering result of the test data which is not extracted and the test data with definite service class attributes to obtain an original data subset corresponding to each service class, wherein each original data subset corresponds to one service class.
In particular, the raw data set may include test data for one or more traffic classes. When the original data set comprises a service class, all the test data in the original data set as a whole are not further divided, and at this time, the test data in the original data set has the same service class attribute, such as test data of a navigation class or test data of an assistance marketing class. When the original data set comprises a plurality of service classes, the test data of the same service class has the same class attribute.
Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a method for training the first classification model in the method for expanding test data according to the present application. Specifically, the method for training the first classification model in this embodiment may include the following steps:
step S21: the method comprises the steps of obtaining a first sample data set, wherein the first sample data set comprises sample testing data of multiple service classes, and the quantity difference of the sample testing data between any two service classes is smaller than a preset threshold value.
Step S22: training a pre-trained first classification model using the first sample dataset.
With the gradual development of the artificial intelligence technology, the requirements of a better neural network model on the sample scale are gradually improved, and for the classification model, if the sample test data volumes of different service classes are greatly different, an overfitting phenomenon occurs in the trained classification model, and the accuracy of model prediction is seriously affected. Specifically, test data of several common service types are integrated to serve as a first sample data set, and then a first classification model can be trained according to characteristics such as service keywords, service description modes, data characteristics and the like, wherein the first classification model can be used for judging service types of the test data in the input original data set. For example, four kinds of test data such as character customer service, auxiliary marketing, navigation and outbound in a real scene are collected, the four kinds of service use the test data with the same size as much as possible, then based on the test data, a pre-trained first classification model is loaded for training, through back propagation training, a loss function is minimized, the trained first classification model can be obtained, and the obtained first classification model is an optimal four-classification model.
Step S13: and aiming at each service class, performing data expansion on the original data subset of the service class by using the data expansion strategy of the service class to obtain a first expanded data subset of the service class.
Step S14: and merging the first expansion data subsets of the service classes to obtain an expansion data set.
In order to make the present application perform well on the expansion results of the original data subsets of different service classes, a self-adaptive data expansion strategy can be adopted to select an optimal data expansion strategy for the original data subset of the current service class, thereby improving the effectiveness of the obtained first expanded data subset.
After the original data subsets of the service classes are subjected to data expansion by using the corresponding data expansion strategy for each service class to obtain the first expanded data subsets of the service classes, the first expanded data subsets of all the service classes are combined to obtain the expanded data sets corresponding to the original data sets.
According to the scheme, an original data set is obtained, the original data set comprises test data of at least one service type, then service classification is carried out on the original data set by using a first classification model, an original data subset corresponding to each service type is obtained, then data expansion is carried out on the original data subset of the service type by using a data expansion strategy of the service type aiming at each service type, a first expanded data subset of the service type is obtained, and then the first expanded data subsets of each service type are combined to obtain an expanded data set. Therefore, only a small amount of test data needs to be input, the service type judgment can be carried out on each piece of input test data, after different service types are classified, automatic test data expansion construction can be carried out by combining preset data expansion strategies of corresponding service types based on the set required expansion amount, the quality of test data acquisition can be effectively improved, the efficiency of test data acquisition is greatly improved, and the generalization and the reliability of the tested engine can be further improved.
Further, please refer to fig. 3, in which fig. 3 is a flowchart illustrating an embodiment of step S13 in fig. 1. In an embodiment, the step S13 may specifically include:
step S131: and acquiring the final expansion weight ratio of the multiple data enhancement modes matched with the service class.
Step S132: and based on the final expansion weight ratio of the multiple data enhancement modes, performing data expansion on the original data subset of the service class by using the multiple data enhancement modes to obtain an expansion data subset of the service class.
And performing data expansion on the original data subsets corresponding to each service class obtained in the step S12 by using different data expansion strategies according to the original data subsets of different service classes. For the data expansion strategy corresponding to a certain service class, different data enhancement modes, such as a traditional EDA method, a translation back method, a MixMatch method, a UDA method, a random data generation method and other single or combined modes, can be adopted to expand the original data subset.
Specifically, the conventional EDA method is to perform some simple transformations on the source text, and mainly includes four operations of synonym replacement (SR: synnyms Replace), random insertion (RI: random Insert), random exchange (RS: random Swap), and random deletion (RD: random Delete); the synonym replacement operation does not consider stop words, n words are randomly extracted from sentences, then synonyms are randomly extracted from a synonym dictionary, and replacement is carried out; the random insertion operation does not consider stop words, one word is randomly extracted, then one word is randomly selected from the synonym set of the word and inserted into the random position in the original sentence, and the process can be repeated for n times; for random switching operation, namely in a sentence, two words are randomly selected, and positions are switched, the process can be repeated for n times; for the random deletion operation, each word in the sentence is deleted randomly with probability p. EDA is a deterministic transformation, the way it is transformed can be specified at will, and it is suitable for the design of a specific scenario, however, all four operations can be set to a scale, i.e. to scale a small part of the source data, instead of replacing it entirely.
The retranslation method is to process the source text by using a translation mode, generally through two translations, namely, the source text is translated into other language texts and then into the source text, and the source text is translated and converted into the source language through two translations.
The MixMatch method is a semi-supervised data enhancement method, guesses the low-entropy label of the unlabeled sample generated by the data enhancement method through MixUp, and mixes the unlabeled data and the labeled data. Compared with conventional noise such as Gaussian noise and dropout noise, the UDA method can generate effective and real noise by using different data enhancement methods for different tasks, has diversified noise and can generate more effective data, and a data enhancement strategy which takes the target and performance as the guide can learn how to find out lost or most wanted training signals in an original mark set.
The core idea of the method for generating random data is to automatically or semi-automatically generate random data input into a program and monitor program exceptions, such as crashes, assertion (assertion) failures, to find out possible program errors, such as memory leaks, etc.
In this embodiment, a plurality of data enhancement modes with better effects are selected from the data enhancement modes, and then different final expansion weight ratios are set for different data enhancement modes according to each service type, so that, for an original data subset of a certain service class, data expansion is performed on the original data subset according to the set final expansion weight ratios of multiple data enhancement modes, thereby obtaining an expanded data subset of the service class.
Further, referring to fig. 4, fig. 4 is a schematic flowchart illustrating an embodiment of step S131 in fig. 3. In an embodiment, the step S131 may specifically include:
step S1311: and determining initial expansion weight ratios of the plurality of data enhancement modes of the service class by using the original data subset of the service class.
Step S1312: and adjusting the initial expansion weight ratio to obtain the final expansion weight ratio.
It can be understood that a plurality of data enhancement modes are selected for data expansion, and the expansion weight ratio of each data enhancement mode is different according to the difference of the characteristics of different service classes. Therefore, the initial expansion weight ratio of the multiple data enhancement modes of a certain service class can be determined by utilizing the original data subset of the service class, and then the initial expansion weight ratio is adjusted according to the service characteristics and attributes of the service class, so that the optimal expansion weight ratio can be obtained, and the final expansion weight ratio of the multiple data enhancement modes of the service class can be obtained.
Specifically, referring to fig. 5, fig. 5 is a schematic flowchart illustrating an embodiment of step S1312 in fig. 4, where in an embodiment, the step S1312 may specifically include:
step S13121: and performing data expansion on the original data subset of the service class by utilizing the multiple data enhancement modes based on the initial expansion weight ratio to obtain a second expanded data subset of the service class.
Step S13122: detecting whether the second expansion data subset meets preset requirements. If not, step S13123 is executed, and if yes, step S13124 is executed.
Step S13123: adjusting the initial expansion weight ratio, re-executing the initial expansion weight ratio, performing data expansion on the original data subset of the service class by using the multiple data enhancement modes based on the initial expansion weight ratio to obtain a second expansion data subset of the service class, and performing the subsequent steps.
Step S13124: and taking the initial expansion weight ratio as the final expansion weight ratio.
It can be understood that, after the initial expansion weight ratios of the multiple data enhancement modes of a certain service class are determined in step S1311, the original data subset of the service class may be subjected to data expansion by using the multiple data enhancement modes based on the initial expansion weight ratios to obtain a second expansion data subset of the service class; then, the second expanded data subset needs to be detected, and whether the obtained second expanded data subset meets the preset requirement is judged. If the second expansion data subset meets the preset requirement, the second expansion data subset obtained by performing data expansion according to the initial expansion weight ratio meets the test optimization requirement, and the initial expansion weight ratio can be used as the final expansion weight ratio; if the second expansion data subset does not meet the preset requirement, the second expansion data subset obtained by performing data expansion according to the initial expansion weight ratio is indicated to not meet the test optimization requirement, so that the initial expansion weight ratio needs to be adjusted, then the original data subset of the service class is subjected to data expansion by using multiple data enhancement modes based on the initial expansion weight ratio, the second expansion data subset of the service class is obtained, and the subsequent steps are performed until the second expansion data subset obtained by performing data expansion according to the adjusted expansion weight ratio meets the preset requirement.
Further, the predetermined requirement is that the validity degree of the second expansion data subset satisfies a predetermined degree threshold. The step S13122 may specifically include: detecting whether the service class of the second expanded data subset is correct or not by using a second classification model so as to obtain the validity degree of the second expanded data subset; and judging whether the effective degree of the second expansion data subset meets a preset degree threshold value. In this case, the step of adjusting the initial expansion weight ratio in step S13123 may include: and adjusting the initial expansion weight ratio according to the effective degree.
It is understood that, for any service class, after the original data subset of the service class is subjected to data expansion based on the initial expansion weight ratio of the multiple data enhancement modes of the service class to obtain a second expanded data subset of the service class, the second expanded data subset of the service class may be input into the second classification model, and then the second classification model may perform validity detection on each piece of data in the second expanded data subset, for example, detect whether the service class of each piece of data is correct to obtain the validity degree of the second expanded data subset. If the effective degree of the second expansion data subset meets the preset degree threshold, the second expansion data subset obtained by performing data expansion according to the initial expansion weight ratio meets the test optimization requirement, and the initial expansion weight ratio can be used as the final expansion weight ratio; if the effective degree of the second expanded data subset does not meet the preset degree threshold, it indicates that the second expanded data subset obtained by performing data expansion according to the initial expanded weight ratio cannot meet the test optimization requirement, so that the initial expanded weight ratio is adjusted according to the effective degree, then the original data subset of the business category is subjected to data expansion by using multiple data enhancement modes based on the adjusted expanded weight ratio again to obtain a new second expanded data subset of the business category, and whether the effective degree of the new second expanded data subset meets the preset degree threshold is judged again until the effective degree of the second expanded data subset obtained by performing data expansion according to the adjusted expanded weight ratio meets the preset degree threshold.
In an embodiment, the method for training the second classification model of the present application may include the following steps: acquiring a second sample data set, wherein the second sample data set comprises sample test data of multiple service classes; and training a pre-trained second classification model by utilizing the second sample data set.
The method comprises the steps of obtaining sample test data of multiple service types as a second sample data set, then training a second classification model, wherein the second classification model can be used for judging the service type of each test data in an input second expansion data subset so as to obtain the effective degree of the second expansion data subset. Specifically, the test data in the second expanded data subset is analyzed through the pre-trained second classification model, and the service class of each test data is obtained. If the result of the service class of the obtained test data is different from the service class of the second expanded data subset, the corresponding test data is valid, and the validity degree of the second expanded data subset can be determined according to whether all the test data in the second expanded data subset are valid; when the effective degree of the second expanded data subset meets the preset degree threshold, the prediction accuracy of the second classification model is required, and when the effective degree of the second expanded data subset does not meet the preset degree threshold, the prediction accuracy of the second classification model is not required, so that the second classification model needs to be optimized until the prediction accuracy of the optimized second classification model is required.
In an embodiment, the first classification model and/or the second classification model may be a bert model, and the loss function thereof may be represented by a commonly used cross-entropy loss function J:
Figure BDA0003126729680000101
referring to fig. 6, fig. 6 is a flowchart illustrating an application scenario of the method for expanding test data according to the present application. In the prediction use stage of the intelligent language interaction engine, N pieces of real test data provided by a client side can be obtained, wherein the N pieces of real test data comprise a plurality of service categories such as character customer service, auxiliary marketing, navigation and outbound; inputting N pieces of real test data into a first classification moduleAfter the model is established, the N pieces of real test data can be subjected to service classification, and are divided into a service I, a service II, a service III and a service IV, wherein the service I, the service II, the service III and the service IV are respectively corresponding to N1、N2、N3、N4Strip true test data, N1+N2+N3+N4N. In addition, the values of the data strips of the service one, the service two, the service three and the service four which need to be expanded are preset to be O, P, Q, R respectively; and then automatically expanding the data according to the preset A, B, C, D optimal expansion weight ratio of four different data enhancement modes for each service type. For example, for service one, A, B, C, D the optimal expansion weight ratio of the four data enhancement modes is w1、w2、w3、w4,w1+w2+w3+w4Obtaining the data of the expanded service one as O pieces; for service two, A, B, C, D the optimal expansion weight ratio of four data enhancement modes is w5、w6、w7、w8,w5+w6+w7+w8Obtaining P pieces of data of the expanded service II as 1; for service three, A, B, C, D, the optimal expansion weight ratio of four data enhancement modes is w5、w6、w7、w8,w9+w10+w11+w12Obtaining Q pieces of data of the expanded service three as 1; for the service four, A, B, C, D four data enhancement modes, the optimal expansion weight ratio is w13、w14、w15、w16,w13+w14+w15+w16The data of the extended service four is R pieces, which is 1. Finally, the expanded data sets under different service types can be obtained to be O + P + Q + R strips in total, and the expanded data sets can be used for testing different service types of the engine.
Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of an expansion device for test data according to the present application. The test data expansion device 70 in this embodiment includes a data acquisition module 700, a data classification module 702, a data expansion module 704, and a data merging module 706, which are connected to each other; the data obtaining module 700 is configured to obtain an original data set; wherein the raw data set comprises test data for at least one traffic class; the data classification module 702 is configured to perform service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service class; the data expansion module 704 is configured to perform data expansion on each original data subset of the service class by using a data expansion policy of the service class to obtain a first expanded data subset of the service class; the data merging module 706 is configured to merge the first extended data subsets of each service class to obtain an extended data set.
In one embodiment, the traffic category includes at least one of text customer service, assisted marketing, navigation, and outbound; and/or the test data is interaction data.
In an embodiment, the test data expansion apparatus 70 further includes a first model training module (not shown), before the data classification module 702 performs the step of performing service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service category, the first model training module is configured to obtain a first sample data set, where the first sample data set includes sample test data of multiple service categories, and a quantity difference between any two service categories of the sample test data is smaller than a preset threshold; training a pre-trained first classification model using the first sample dataset.
In an embodiment, the data expansion module 704 performs a step of performing data expansion on each original data subset of the service class by using the data expansion policy of the service class to obtain a first expanded data subset of the service class, including: acquiring the final expansion weight ratio of the multiple data enhancement modes matched with the service class; and based on the final expansion weight ratio of the multiple data enhancement modes, performing data expansion on the original data subset of the service class by using the multiple data enhancement modes to obtain an expansion data subset of the service class.
In one embodiment, the data expansion module 704 performs the step of obtaining the final expansion weight ratio of the plurality of data enhancement modes matching the service class, including: determining initial expansion weight ratios of the multiple data enhancement modes of the service class by using the original data subset of the service class; and adjusting the initial expansion weight ratio to obtain the final expansion weight ratio.
In one embodiment, the data expansion module 704 performs the step of adjusting the initial expansion weight ratio to obtain the final expansion weight ratio, including: based on the initial expansion weight ratio, performing data expansion on the original data subset of the service class by using the multiple data enhancement modes to obtain a second expansion data subset of the service class; detecting whether the second expansion data subset meets a preset requirement; if not, adjusting the initial expansion weight ratio, and re-executing the original data subset of the service category based on the initial expansion weight ratio and using the multiple data enhancement modes to perform data expansion to obtain a second expansion data subset of the service category, and performing the subsequent steps; if so, the initial expansion weight ratio is taken as the final expansion weight ratio.
In one embodiment, the data expansion module 704 performs the step of detecting whether the second expanded data subset meets a predetermined requirement, including: detecting whether the service class of the second expanded data subset is correct or not by using a second classification model so as to obtain the validity degree of the second expanded data subset; judging whether the effective degree of the second expansion data subset meets a preset degree threshold value or not; the data expansion module 704 performs the step of adjusting the initial expansion weight ratio, including: and adjusting the initial expansion weight ratio according to the effective degree.
For the specific contents of the method for implementing the test data expansion by the test data expansion device of the present application, please refer to the contents in the above embodiment of the test data expansion method, which is not described herein again.
Referring to fig. 8, fig. 8 is a schematic frame diagram of an embodiment of an electronic device according to the present application. The electronic device 80 comprises a memory 801 and a processor 802 coupled to each other, wherein the processor 802 is configured to execute program instructions stored in the memory 801 to implement the steps of any of the above-described embodiments of the method for augmenting test data. In one particular implementation scenario, the electronic device 80 may include, but is not limited to: microcomputer, server.
In particular, the processor 802 is configured to control itself and the memory 801 to implement the steps of any of the above-described expanded method embodiments of test data. The processor 802 may also be referred to as a CPU (Central Processing Unit). The processor 802 may be an integrated circuit chip having signal processing capabilities. The Processor 802 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 802 may be commonly implemented by integrated circuit chips.
In the above scheme, the processor 802 obtains an original data set including test data of at least one service class, then performs service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service class, then performs data expansion on the original data subset of the service class by using a data expansion strategy of the service class to obtain a first expanded data subset of the service class, and then merges the first expanded data subsets of the service classes to obtain an expanded data set. Therefore, only a small amount of test data needs to be input, the service type judgment can be carried out on each piece of input test data, after different service types are classified, automatic test data expansion construction can be carried out by combining preset data expansion strategies of corresponding service types based on the set required expansion amount, the quality of test data acquisition can be effectively improved, the efficiency of test data acquisition is greatly improved, and the generalization and the reliability of the tested engine can be further improved.
Referring to fig. 9, fig. 9 is a block diagram illustrating an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 90 stores program instructions 900 capable of being executed by a processor, the program instructions 900 being for implementing the steps of any of the test data augmentation method embodiments described above.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described model embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (10)

1. An expansion method of test data, the expansion method comprising:
acquiring an original data set; wherein the raw data set comprises test data for at least one traffic class;
carrying out service classification on the original data set by using a first classification model to obtain an original data subset corresponding to each service class;
aiming at each service class, performing data expansion on an original data subset of the service class by using a data expansion strategy of the service class to obtain a first expanded data subset of the service class;
and merging the first expansion data subsets of the service classes to obtain an expansion data set.
2. The augmentation method of claim 1, wherein the traffic categories include at least one of text customer service, assisted marketing, navigation, and outbound; and/or the test data is interaction data.
3. The augmentation method of claim 1, wherein before the traffic classification of the original data set using the first classification model, the method further comprises the following steps to train the first classification model:
acquiring a first sample data set, wherein the first sample data set comprises sample test data of multiple service classes, and the quantity difference of the sample test data between any two service classes is smaller than a preset threshold value;
training a pre-trained first classification model using the first sample dataset.
4. The expansion method according to claim 1, wherein the data expansion of the original data subset of the traffic class by using the data expansion policy of the traffic class to obtain the expanded data subset of the traffic class comprises:
acquiring the final expansion weight ratio of the multiple data enhancement modes matched with the service class;
and based on the final expansion weight ratio of the multiple data enhancement modes, performing data expansion on the original data subset of the service class by using the multiple data enhancement modes to obtain an expansion data subset of the service class.
5. The expansion method according to claim 4, wherein the obtaining of the final expansion weight ratio of the plurality of data enhancement modes matching the service class comprises:
determining initial expansion weight ratios of the multiple data enhancement modes of the service class by using the original data subset of the service class;
and adjusting the initial expansion weight ratio to obtain the final expansion weight ratio.
6. The expansion method according to claim 5, wherein the adjusting the initial expansion weight ratio to obtain the final expansion weight ratio comprises:
based on the initial expansion weight ratio, performing data expansion on the original data subset of the service class by using the multiple data enhancement modes to obtain a second expansion data subset of the service class;
detecting whether the second expansion data subset meets a preset requirement;
if not, adjusting the initial expansion weight ratio, and re-executing the original data subset of the service category based on the initial expansion weight ratio and using the multiple data enhancement modes to perform data expansion to obtain a second expansion data subset of the service category, and performing the subsequent steps;
if so, the initial expansion weight ratio is taken as the final expansion weight ratio.
7. The augmentation method of claim 6, wherein the detecting whether the second augmented data subset meets a predetermined requirement comprises:
detecting whether the service class of the second expanded data subset is correct or not by using a second classification model so as to obtain the validity degree of the second expanded data subset;
judging whether the effective degree of the second expansion data subset meets a preset degree threshold value or not;
the adjusting the initial expansion weight ratio comprises:
and adjusting the initial expansion weight ratio according to the effective degree.
8. An expansion device for test data, comprising:
a data acquisition module for acquiring an original data set; wherein the raw data set comprises test data for at least one traffic class;
the data classification module is used for carrying out service classification on the original data set by utilizing a first classification model to obtain an original data subset corresponding to each service class;
the data expansion module is used for carrying out data expansion on each original data subset of the service class by utilizing a data expansion strategy of the service class aiming at each service class to obtain a first expanded data subset of the service class;
and the data merging module is used for merging the first expansion data subsets of the service classes to obtain an expansion data set.
9. An electronic device comprising a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the method of augmenting test data according to any one of claims 1 to 7.
10. A computer-readable storage medium on which program instructions are stored, which program instructions, when executed by a processor, implement the method of augmenting test data according to any one of claims 1 to 7.
CN202110691032.0A 2021-06-22 2021-06-22 Test data expansion method, device, equipment and computer readable storage medium Pending CN113535549A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110691032.0A CN113535549A (en) 2021-06-22 2021-06-22 Test data expansion method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110691032.0A CN113535549A (en) 2021-06-22 2021-06-22 Test data expansion method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113535549A true CN113535549A (en) 2021-10-22

Family

ID=78096373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110691032.0A Pending CN113535549A (en) 2021-06-22 2021-06-22 Test data expansion method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113535549A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN107688630A (en) * 2017-08-21 2018-02-13 北京工业大学 A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN110334772A (en) * 2019-07-11 2019-10-15 山东领能电子科技有限公司 A kind of quick mask method of expansion classification formula data
CN110442685A (en) * 2019-08-14 2019-11-12 杭州品茗安控信息技术股份有限公司 Data extending method, apparatus, equipment and the storage medium of architectural discipline dictionary
CN110705212A (en) * 2019-09-09 2020-01-17 广州小鹏汽车科技有限公司 Text sequence processing method, processing device, electronic terminal and medium
CN111291560A (en) * 2020-03-06 2020-06-16 深圳前海微众银行股份有限公司 Sample expansion method, terminal, device and readable storage medium
CN111477216A (en) * 2020-04-09 2020-07-31 南京硅基智能科技有限公司 Training method and system for pronunciation understanding model of conversation robot
CN111651566A (en) * 2020-08-10 2020-09-11 四川大学 Multi-task small sample learning-based referee document dispute focus extraction method
CN112580826A (en) * 2021-02-05 2021-03-30 支付宝(杭州)信息技术有限公司 Business model training method, device and system
CN112633401A (en) * 2020-12-29 2021-04-09 中国科学院长春光学精密机械与物理研究所 Hyperspectral remote sensing image classification method, device, equipment and storage medium
CN112784050A (en) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 Method, device, equipment and medium for generating theme classification data set

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN107688630A (en) * 2017-08-21 2018-02-13 北京工业大学 A kind of more sentiment dictionary extending methods of Weakly supervised microblogging based on semanteme
CN110334772A (en) * 2019-07-11 2019-10-15 山东领能电子科技有限公司 A kind of quick mask method of expansion classification formula data
CN110442685A (en) * 2019-08-14 2019-11-12 杭州品茗安控信息技术股份有限公司 Data extending method, apparatus, equipment and the storage medium of architectural discipline dictionary
CN110705212A (en) * 2019-09-09 2020-01-17 广州小鹏汽车科技有限公司 Text sequence processing method, processing device, electronic terminal and medium
CN111291560A (en) * 2020-03-06 2020-06-16 深圳前海微众银行股份有限公司 Sample expansion method, terminal, device and readable storage medium
CN111477216A (en) * 2020-04-09 2020-07-31 南京硅基智能科技有限公司 Training method and system for pronunciation understanding model of conversation robot
CN111651566A (en) * 2020-08-10 2020-09-11 四川大学 Multi-task small sample learning-based referee document dispute focus extraction method
CN112633401A (en) * 2020-12-29 2021-04-09 中国科学院长春光学精密机械与物理研究所 Hyperspectral remote sensing image classification method, device, equipment and storage medium
CN112784050A (en) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 Method, device, equipment and medium for generating theme classification data set
CN112580826A (en) * 2021-02-05 2021-03-30 支付宝(杭州)信息技术有限公司 Business model training method, device and system

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN107423278B (en) Evaluation element identification method, device and system
CA3174601A1 (en) Text intent identifying method, device, computer equipment and storage medium
CN111210842B (en) Voice quality inspection method, device, terminal and computer readable storage medium
Srinivasan A study of two sampling methods for analyzing large datasets with ILP
CN108959474B (en) Entity relation extraction method
CN106570180A (en) Artificial intelligence based voice searching method and device
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN109145301B (en) Information classification method and device and computer readable storage medium
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN111539612B (en) Training method and system of risk classification model
CN110968664A (en) Document retrieval method, device, equipment and medium
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN117216214A (en) Question and answer extraction generation method, device, equipment and medium
CN108021595A (en) Examine the method and device of knowledge base triple
CN116451646A (en) Standard draft detection method, system, electronic equipment and storage medium
CN113535549A (en) Test data expansion method, device, equipment and computer readable storage medium
CN113722421B (en) Contract auditing method and system and computer readable storage medium
CN114896962A (en) Multi-view sentence matching model, application method and related device
CN114564942A (en) Text error correction method, storage medium and device for supervision field
CN112632284A (en) Information extraction method and system for unlabeled text data set
CN112015857A (en) User perception evaluation method and device, electronic equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination