CN113807528A - Model optimization method, device and storage medium - Google Patents

Model optimization method, device and storage medium Download PDF

Info

Publication number
CN113807528A
CN113807528A CN202010550559.7A CN202010550559A CN113807528A CN 113807528 A CN113807528 A CN 113807528A CN 202010550559 A CN202010550559 A CN 202010550559A CN 113807528 A CN113807528 A CN 113807528A
Authority
CN
China
Prior art keywords
sample
sample sets
sets
marking
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010550559.7A
Other languages
Chinese (zh)
Inventor
陈泽晗
赵伟
陈岳峰
何源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010550559.7A priority Critical patent/CN113807528A/en
Publication of CN113807528A publication Critical patent/CN113807528A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application provides a model optimization method, equipment and a storage medium. In the embodiment of the application, a plurality of sample data can be marked in advance to obtain a plurality of sample sets, and on the basis, a target sample set meeting the preset requirement can be selected in batches from the plurality of sample sets; and training the model to be promoted based on the selected target sample set. Therefore, in the embodiment, the target sample set can be selected in batches by integrating the sample data and the marking information, and the target sample set is added into the training set. Therefore, a large batch of target sample sets can be efficiently mined, so that the value of massive backflow data is fully exerted; a target sample set carrying essence knowledge can be more accurately and comprehensively excavated from the backflow data, so that the structure of the training set can be optimized, the quality of the training set is improved, and the performance of the model is continuously improved; in addition, the mode of excavating the target sample set in batches can greatly reduce the sample selection times of the model to be promoted, thereby effectively promoting the efficiency of model optimization.

Description

Model optimization method, device and storage medium
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a model optimization method, device, and storage medium.
Background
Conventional active learning models are typically: a ═ C, Q, S, L, U. Where C is a set or a classifier and L is the labeled sample used for training. Q is a query function and is used for querying information with large information quantity from the unmarked sample pool U, and S is a supervisor and can mark labels for samples queried by Q. The model starts to learn through a small number of initial labeled samples L, selects the most useful samples through a certain query function Q, inquires labels for a governor, and trains a classifier and carries out the next round of query by using the obtained new knowledge.
However, the model trained in this way has reached the performance bottleneck and cannot meet the increasingly higher performance requirements of the model.
Disclosure of Invention
Aspects of the present disclosure provide a model optimization method, apparatus, and storage medium to improve performance of a machine learning model.
The embodiment of the application provides a model optimization method, which comprises the following steps:
obtaining a plurality of sample sets, wherein the sample sets comprise sample data and marking information;
selecting target sample sets meeting preset requirements in batches from the sample sets on the basis of sample data and marking information contained in the sample sets;
and training a model to be lifted according to the target sample set.
The embodiment of the application also provides a computing device, which comprises a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
obtaining a plurality of sample sets, wherein the sample sets comprise sample data and marking information;
selecting target sample sets meeting preset requirements in batches from the sample sets on the basis of sample data and marking information contained in the sample sets;
and training a model to be lifted according to the target sample set.
Embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the aforementioned model optimization method.
In the embodiment of the application, a plurality of sample data can be marked in advance to obtain a plurality of sample sets, and on the basis, a target sample set meeting the preset requirement can be selected in batches from the plurality of sample sets; and training the model to be promoted based on the selected target sample set. Therefore, in the embodiment, the target sample set can be selected in batches by integrating the sample data and the marking information, and the target sample set is added into the training set. This achieves at least the following technical effects:
1. a large batch of target sample sets can be efficiently mined, so that the magnitude order of a training set is greatly improved, and the value of massive backflow data is fully exerted;
2. a target sample set carrying essence knowledge can be more accurately and comprehensively excavated from the backflow data, so that the structure of the training set can be optimized, the quality of the training set is improved, and the performance of the model is further continuously improved;
3. by means of the method for mining the target sample set in batches, the query times of the query function of the model to be promoted can be greatly reduced, and therefore the efficiency of model optimization can be effectively improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart diagram illustrating a model optimization method according to an exemplary embodiment of the present application;
FIG. 2 is a logical schematic of a model optimization scheme provided in an exemplary embodiment of the present application;
FIG. 3 is a schematic logical representation of another model optimization scheme provided by an example of the present application;
fig. 4 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
To address the technical problem that existing model training schemes have reached model performance bottlenecks, in some embodiments of the present application: marking a plurality of sample data in advance to obtain a plurality of sample sets, and on the basis, selecting target sample sets meeting preset requirements in batches from the plurality of sample sets; and training the model to be promoted based on the selected target sample set. Therefore, in the embodiment, the target sample set can be selected in batches by integrating the sample data and the marking information, and the target sample set is added into the training set. Thereby the value of massive reflow data can be fully exerted; optimizing the structure of the training set, improving the quality of the training set and continuously improving the performance of the model; moreover, the efficiency of model optimization can be effectively improved.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a model optimization method according to an exemplary embodiment of the present disclosure. FIG. 2 is a logic diagram of a model optimization scheme provided in an exemplary embodiment of the present application. The model optimization method provided by the embodiment can be executed by a model optimization device, which can be implemented as software or as a combination of software and hardware, and can be integrally arranged in a computing device.
As shown in fig. 1, the method includes:
100, obtaining a plurality of sample sets, wherein the sample sets comprise sample data and marking information;
step 101, selecting target sample sets meeting preset requirements in batches from a plurality of sample sets based on sample data and marking information contained in the sample sets respectively;
and 102, training a model to be lifted according to the target sample set.
The model optimization method provided by the embodiment can be used for improving the performance of a machine learning model, particularly an active learning model. The application scenario is not limited in this embodiment, and the model to be promoted may be a machine learning model that may be used in various application scenarios.
In addition, the model to be promoted can be a model which is optimized in various application scenes by adopting a traditional training mode, and under the condition, the model optimization method provided by the embodiment can further optimize the model to be promoted; certainly, the model to be promoted may also be an initial model in various application scenarios, and in this case, the model optimization method provided in this embodiment may efficiently optimize the model to be promoted to the required performance.
In this embodiment, the model to be promoted includes, but is not limited to, a residual error network Resnet model, a visual geometry group VGG model, an inceptionV3 model, and the like. Of course, these are merely exemplary, and the various models mentioned therein may be further subdivided into more types of models, the model architecture adopted by the model to be promoted is not limited in this embodiment, and this embodiment has universality for various machine learning models.
With the development of machine learning technology, machine learning models are introduced into more and more application scenes to solve the problems of classification, regression and the like, and massive backflow data are generated in the application process of the machine learning models.
In this embodiment, the application scenarios are not limited, model architectures adopted by the models to be promoted may not be completely the same in different application scenarios, and the model optimization scheme provided in this embodiment has universality for the models used in various application scenarios, and can effectively improve the performance of the models. For example, an application scenario may include: the method for optimizing the model of the mobile terminal comprises a live broadcast scene, a social scene, an animation scene, an e-commerce scene, a financial scene, an intelligent transportation scene, a medical management scene and the like, which are also only exemplary, and the model optimization method provided by the embodiment can be applied to various application scenes using the machine learning model.
In addition, in this embodiment, the problem to be handled by the model to be lifted is not limited, and in different application scenarios, the problem to be handled by the model to be lifted may be different, for example, the model to be lifted may be used for handling an image classification problem, and may also be used for handling an order allocation problem, and the like.
In this embodiment, a data pool may be constructed from these reflow data. Accordingly, in this embodiment, the data pool may be on the order of millions, tens of millions, or even higher.
In this embodiment, the reflow data may be directly used as sample data. Of course, in practical applications, the reflow data may also be pre-screened to screen out a part of the data therefrom as the sample data in this embodiment. For example, the reflow data in the data pool may be pre-filtered from the dimension of the information entropy, and the part of the reflow data where the information entropy is not high enough is discarded to filter out the sample data in the present embodiment. This embodiment is not limited to this.
In this embodiment, the sample data may also be in the order of millions, tens of millions, or even higher.
On this basis, the embodiment can mark the sample data in advance.
In this embodiment, the marking method is not limited. For example, manual marking or batch marking using a marking model may be used, and of course, marking of sample data may be performed using any marking method used now or in the future.
In this embodiment, a sample set may be composed of a sample and the marking information corresponding to the sample. As mentioned above, the sample data may be in the order of millions, tens of millions, or even higher, and accordingly, in this embodiment, sample sets in the order of millions, tens of millions, or even higher are obtained.
In different scenes, the content carried by the marking information may not be completely the same. For example, in a classification scenario, the marking information may be category information, and in a regression scenario, the marking information may be a regression result, and so on.
In step 101, a target sample set meeting a preset requirement may be selected in batch from a plurality of sample sets based on sample data and marking information included in each of the plurality of sample sets.
Therefore, in this embodiment, the target sample set can be selected based on the marked sample data, and added to the training set. The training set refers to a training set used for model training.
In this embodiment, the target sample set may be selected in batches. The selected target sample set may be in the order of millions, tens of millions, or even higher.
In the selection process of the target sample set, model iteration to be promoted is not needed, so that the calculation resource consumption caused by model iteration can be effectively reduced, and the selection efficiency of the target sample set is improved.
Therefore, the model optimization scheme provided by the embodiment is particularly suitable for the situation that the order of magnitude of the training set exceeds ten thousand, and the target sample set with the required order of magnitude can be efficiently mined.
In this embodiment, in the process of selecting the target sample set, not only the sample data itself but also the marking information of the sample data is concerned. The quality of the sample data can be evaluated more comprehensively and reasonably, and the structure of the training set can be optimized from a global level. Therefore, the essence knowledge in massive backflow data can be sufficiently mined, and therefore the quality of the training set can be greatly improved through the sample selection mode provided by the embodiment.
Based on this, under the condition that the magnitude of the training set exceeds ten thousands of levels, compared with the traditional model training mode, the model optimization mode provided by the embodiment can more accurately and comprehensively excavate a target sample set carrying essence knowledge from the backflow data to obtain a training set with higher quality, so that the model performance bottleneck which can be reached by the traditional model training mode can be broken through, and the further improvement of the model performance is realized. Moreover, with the continuous refreshing of the backflow data, the model optimization scheme provided by the embodiment can dig out a target sample set carrying new essence knowledge from the new backflow data, so that the model performance can be continuously improved.
In step 102, the target sample set may be input into the model to be lifted at one time to train the model to be lifted. Of course, the target sample set may also be input into the model to be lifted in batch to train the model to be lifted, which is not limited in this embodiment.
In summary, based on the model optimization scheme provided in this embodiment, at least the following technical effects can be obtained:
1. a large batch of target sample sets can be efficiently mined, so that the magnitude order of a training set is greatly improved, and the value of massive backflow data is fully exerted;
2. a target sample set carrying essence knowledge can be more accurately and comprehensively excavated from the backflow data, so that the structure of the training set can be optimized, the quality of the training set is improved, and the performance of the model is further continuously improved;
3. by means of the method for mining the target sample set in batches, the number of times of selecting samples of the model to be improved can be greatly reduced, and therefore the efficiency of model optimization can be effectively improved.
In the above or below embodiments, the sample value corresponding to each of the plurality of sample sets may be calculated according to the sample data and the marking information included in each of the plurality of sample sets.
In this embodiment, the sample value may be used to measure the quality of the sample set. The larger the sample value of the sample set, the greater the role played in the model optimization process.
In an exemplary implementation manner, marking information prediction can be performed on sample data contained in each of a plurality of sample sets respectively, so as to obtain prediction probabilities of prediction results corresponding to each sample data; respectively determining marking quality parameters corresponding to the sample sets according to marking information contained in the sample sets; and calculating the sample value corresponding to each of the plurality of sample sets according to the marking quality parameters corresponding to each of the plurality of sample sets and the prediction probability of the prediction result.
The marking information prediction process can be executed by utilizing the model to be lifted. That is, inputting each sample data into the model to be lifted, so as to output the prediction result corresponding to each sample data by using the model to be lifted; and obtaining the prediction probability of the prediction result corresponding to each sample data.
For example, in a classification scenario, a model to be promoted may be used to output a classification result corresponding to each sample data, and a prediction probability of the classification result corresponding to each sample data is obtained.
In practical applications, if the sample set is expressed as { x }j,yjJ is the number of the sample sets, and the prediction probability of the prediction result corresponding to each sample data can be expressed as a formula one:
Figure BDA0002542356730000071
wherein, Pm(yi|xj) Representing sample data xjPredicting as marking information yiA predicted probability of (d);
Figure BDA0002542356730000074
and the marking information with the maximum prediction probability, namely, the prediction result is shown.
In this implementation, multiple schemes may be employed to determine marking quality parameters corresponding to each of the plurality of sample sets.
Wherein, the marking quality parameter is used for measuring the quality of the marking information.
In an exemplary scheme, marking quality parameters corresponding to a plurality of sample sets can be determined according to the prediction result and the marking information corresponding to each sample data.
In the exemplary scheme, if the prediction result corresponding to the first sample data is consistent with the marking information, the marking quality parameter of the sample set where the first sample data is located is determined to be correct in marking; and if the prediction result corresponding to the first sample data is inconsistent with the marking information, determining that the marking quality parameter of the sample set where the first sample data is located is a marking error. The first sample data is sample data contained in any one of a plurality of sample sets.
In practical application, the marking quality parameter can be marked as 1 under the condition that the marking quality parameter is correct; and, when the marking quality parameter is that the marking is correct, the marking quality parameter can be marked as 0.
According to the first adapting formula, marking quality parameters can be expressed as
Figure BDA0002542356730000072
Based on this, if
Figure BDA0002542356730000073
The marking quality parameter is 1, otherwise, the marking quality parameter is 0.
Of course, other schemes may also be adopted to determine the marking quality parameters corresponding to the several sample sets, for example, schemes such as manual evaluation and the like, which are not limited herein.
Therefore, marking quality parameters and prediction probabilities of prediction results corresponding to the sample sets can be obtained.
On this basis, in an exemplary computing scheme: subtracting the prediction probability of the prediction result corresponding to the first sample set by using 1 to obtain a first factor; marking quality parameters corresponding to the first sample set are used as second factors; calculating the product of the first factor and the second factor as the sample value corresponding to the first sample set; wherein the first sample set is any one of a plurality of sample sets.
The above exemplary calculation scheme can be expressed as formula two:
Figure BDA0002542356730000081
bearing on the way that the marking quality parameter is denoted as 0 and 1 above, in this exemplary calculation scheme, when the marking quality parameter is 0, the sample value of the sample set will also be 0, whereas when the marking quality parameter is 1, the higher the prediction probability of the prediction result, the lower the sample value of the sample set.
Of course, the above is only an exemplary implementation of the sample value, and in this embodiment, based on the sample data and the marking information included in each of the plurality of sample sets, other implementations may also be adopted to calculate the sample value according to the sample set. For example, the marking information and the sample data are respectively scored according to a preset evaluation strategy, and the scores of the two aspects are weighted and summed to determine the sample value of the sample set, and the present embodiment is not limited thereto.
Accordingly, sample values corresponding to the sample sets can be obtained.
On this basis, in this embodiment, a target sample set meeting the preset requirement may be selected in batch from a plurality of sample sets according to the sample values corresponding to the sample sets, respectively.
In practical application, based on the sample values corresponding to the plurality of sample sets, a target sample set meeting preset requirements can be selected from the plurality of sample sets from at least one selection dimension.
Fig. 3 is a logic diagram of another model optimization scheme provided in an example of the present application.
Referring to fig. 3, at least one selected dimension includes, but is not limited to, selecting a total volume dimension, a marking quality dimension, or a sample equalization dimension.
In the following, taking the above exemplary dimensions as examples, the selection schemes of the target sample set in the exemplary dimensions will be described respectively.
Selecting a Total volume dimension
Under the selection of the total amount dimension, N sample sets with the maximum sample value can be selected from a plurality of sample sets to serve as target sample sets, wherein N is a preset selection total amount.
Under some optimization requirements, there may be a limit on the total number of sample sets in the training set, in which case, the total number N of selected target sample sets may be determined, and from several sample sets, the N sample sets with the greatest sample value are selected as the target sample set.
Therefore, the sample set with high sample value can be preferentially used for model optimization, so that the most essential sample set can be used for model optimization under the condition of limited scale of the training set, and a higher optimization effect is achieved.
Marking quality dimension
Under the marking quality dimension, a plurality of sample sets can be divided into a correct marking sample set and an incorrect marking sample set according to the respective corresponding sample values of the sample sets; and according to the preset occupation ratio requirement of the error marking sample set, discarding part of the error marking sample sets in the plurality of sample sets to obtain a target sample set meeting the occupation ratio requirement.
Because the sample sets in the embodiment are pre-marked, in order to avoid that the mistakenly marked sample sets pull down the model performance under the condition that the marking is inaccurate, in the embodiment, a plurality of sample sets can be cleaned from the marking quality dimension, and the proportion of dirty data is regulated and controlled. The dirty data is also the set of incorrectly marked samples.
The quality of the target sample set can be effectively guaranteed, particularly, the performance of the model can be lowered by mistakenly marking the sample set under the condition that the magnitude of the training set is large, and the sample set under the marking quality dimension is selected, so that the situation can be avoided.
Sample equilibrium dimensionality
In this embodiment, on the first hand, a plurality of sample sets all contain marking information, and on the second hand, sample value evaluation is performed on a plurality of sample sets.
Based on the above first aspect, in this embodiment, the marking information may be used as a grouping basis to accurately group the sample sets, so that the prediction performance of the model to be promoted on different prediction results may be more comprehensively covered.
And flexibly setting the sample balance proportion under different optimization requirements, wherein the sample balance proportion refers to the proportion between sample data under different marking information. According to the sample balance proportion, the number of selectable target sample sets under different marking information can be determined.
Therefore, based on the marking information, the selected target sample set can be ensured to accord with the sample balance proportion, the structure of the training set can be greatly optimized, and the sample data in the training set is more reasonably and more uniformly distributed; moreover, the coverage scheme of the model to be promoted on the prediction performances of different prediction results can be regulated and controlled more flexibly, for example, for the prediction results with substandard prediction performances, a higher proportion can be set for the marking information corresponding to the prediction results in the sample balance proportion, so that the prediction performances of the prediction results can be optimized emphatically.
Further, the above second aspect may also be combined: and selecting sample sets matched with the number of the sample sets required to be selected under the multiple groups of sample sets respectively as target sample sets based on the sample values corresponding to the multiple sample sets respectively.
In practical application, sample data under different labeling information can be sorted according to sample value, that is, a plurality of groups of sample sets are sorted in groups.
Therefore, the sample sets which are matched with the number of the sample sets needing to be selected below and have the maximum sample value can be selected as the target sample sets respectively under the multiple groups of sample sets according to the preset sample balance proportion.
Therefore, by combining the two aspects, the distribution of the sample data in the training set is more reasonable and balanced, and the selected target sample set has higher quality and carries more essence knowledge.
In the above, from several exemplary selection dimensions, the selection schemes of the target sample set are respectively set forth. It should be understood that at least one selection dimension may be applied individually and may be flexibly combined with each other to select a target sample set.
In one exemplary approach, a target set of samples may be selected in conjunction with a sample equalization dimension and a marking quality dimension.
Part of wrongly labeled sample sets can be discarded from a plurality of sample sets according to the labeling quality dimension; on the basis, the target sample set can be further selected from the rest sample sets according to the sample equalization dimension. The detailed process can refer to the selection scheme of the sample equalization dimension and the marking quality dimension, and the detailed description is not repeated herein.
Of course, the selection order of the two selection dimensions may also be changed, that is, the selection result of the sample equalization dimension is used as the selection basis of the marking quality dimension, which is not limited in this embodiment.
Similarly, in this embodiment, the combination manner and the selection order of each selection dimension can also be flexibly adjusted, which is not exhaustive, but should not cause a loss of the protection scope of the present application.
In summary, the embodiment provides a brand-new scheme for measuring the sample value. Compared with the traditional scheme of measuring the sample value from dimensions such as information entropy and the like, the method can determine the sample value of the sample data more comprehensively and reasonably. On the basis, based on the respective sample values of the plurality of sample sets, the target sample set carrying the essence knowledge can be more accurately and comprehensively mined from the backflow data, so that the structure of the training set can be optimized, the quality of the training set is improved, and the performance of the model is continuously improved.
The following describes a model optimization scheme by using an image classification model as a model to be promoted.
In the function of searching for commodities by figures in the e-commerce field, it is often required to firstly classify images input by users by using an image classification model, and the categories can include dresses, trousers and the like. The accuracy of the classification results will directly affect the quality of the search results.
Based on the traditional model training method, the bottleneck of model performance is reached, and the model performance can not be further improved.
The model optimization method provided by the embodiment can break through the bottleneck and further improve the model performance.
A large number of images input by the user can be used as reflow data corresponding to the function of searching for the commodity by using the image, and generally, the magnitude of the reflow data can reach tens of millions.
Based on this, first, the reflow data may be marked, for example, manually or with a marking model. And an image sample in the reflow data and the marking information corresponding to the image sample can be combined into a sample set. In this way, a sample set of the order of tens of millions can be obtained. For example, a sample set may be [ Picture A, dress ]
Thereafter, a sufficiently large number of high quality sample sets can be selected at once from the ten million-level sample sets.
In this embodiment, preparation before optimization can be made from the following three aspects:
and classifying and predicting the image samples in each sample set by using an image classification model, and marking the sample sets as dirty data if the prediction result is inconsistent with the marking information.
And calculating the sample value V of each sample set according to the output probability y corresponding to the prediction result, wherein for the non-dirty data, the larger the output probability y is, the lower the sample value V is.
In addition, since the marking information included in the sample set substantially indicates the category, the number of sample sets in different categories may be counted.
Based on the preparation of the three aspects, the sample set can be selected from the dimensions of the required total sample set amount, dirty data proportion regulation, sample balance proportion and the like.
For example, if 90-100 ten thousand sample sets are needed, the dirty data ratio is less than 1%, and the sample balance ratio is as close as possible to 3:4:3, an exemplary sample set selection process may be:
firstly, according to the marking information, dividing the sample set into three groups, wherein the marking information of the sample set in the same group is consistent, and sequencing the sample set according to the value of the sample in each group.
Under the three groups of sample sets, 30 ten thousand, 40 ten thousand and 30 ten thousand sample sets with the highest sample value are respectively selected according to the sample balance proportion.
And then, judging whether the proportion of the dirty data in the selected 100 ten thousand sample sets is lower than 1%, if not, deleting partial dirty data from the 100 ten thousand sample sets to ensure that the proportion of the dirty data is lower than 1%.
So far, about 100 million high-quality sample sets can be picked out from ten million sample sets at a time. In the process of selecting the sample set, iterative operation of an image classification model is not needed, so that the sample selection times can be effectively reduced.
And then, the performance optimization of the image classification model can be realized only by inputting the selected about 100 ten thousand sample sets into the image classification model. The selected sample set is balanced in distribution and high in training value, so that the performance of the image classification model can be further improved by breaking through the bottleneck on the basis of the traditional training scheme, the sample set selection scheme can be continuously executed on new backflow data subsequently, and the image classification model is continuously optimized, so that the performance of the image classification model can be continuously improved.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 100 to 102 may be device a; for another example, the execution subject of steps 100 and 101 may be device a, and the execution subject of step 102 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 100, 101, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", and the like in this document are used to distinguish different sample data, sample sets, and the like, and do not represent a sequential order, and do not limit that "first" and "second" are different types.
Fig. 4 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. As shown in fig. 4, the computing device includes: a memory 40 and a processor 41.
Memory 40 is used to store computer programs and may be configured to store other various data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 40 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A processor 41, coupled to the memory 40, for executing computer programs in the memory for:
acquiring a plurality of sample sets, wherein the sample sets comprise sample data and marking information;
selecting target sample sets meeting preset requirements in batches from the sample sets on the basis of sample data and marking information contained in the sample sets;
and training the model to be lifted according to the target sample set.
In an optional embodiment, the processor 41, when batch-selecting a target sample set meeting a preset requirement from the sample sets based on the sample data and the marking information included in each of the sample sets, is configured to:
calculating the sample value corresponding to each of the plurality of sample sets according to the sample data and the marking information contained in each of the plurality of sample sets;
and selecting target sample sets meeting preset requirements in batches from the plurality of sample sets according to the sample values corresponding to the plurality of sample sets respectively.
In an optional embodiment, the processor 41, when calculating the sample value corresponding to each of the plurality of sample sets according to the sample data and the marking information included in each of the plurality of sample sets, is configured to:
respectively carrying out marking information prediction on sample data contained in each of the plurality of sample sets to obtain prediction probabilities of prediction results corresponding to each sample data;
respectively determining marking quality parameters corresponding to the sample sets according to marking information contained in the sample sets;
and calculating the sample value corresponding to each of the plurality of sample sets according to the marking quality parameters corresponding to each of the plurality of sample sets and the prediction probability of the prediction result.
In an optional embodiment, the processor 41, when determining the marking quality parameters corresponding to the several sample sets respectively according to the marking information included in the several sample sets respectively, is configured to:
and determining marking quality parameters corresponding to the sample sets according to the prediction results and the marking information corresponding to the sample data.
In an optional embodiment, the processor 41 is configured to, when calculating the marking quality parameter of each of the plurality of sample sets according to the prediction result and the marking information corresponding to each of the sample data:
if the prediction result corresponding to the first sample data is consistent with the marking information, determining that the marking quality parameter of the sample set where the first sample data is located is correct for marking;
and if the prediction result corresponding to the first sample data is inconsistent with the marking information, determining that the marking quality parameter of the sample set where the first sample data is located is a marking error.
The first sample data is sample data contained in any one of a plurality of sample sets.
In an alternative embodiment, the processor 41, when calculating the sample value corresponding to each of the plurality of sample sets according to the marking quality parameter and the prediction probability of the prediction result corresponding to each of the plurality of sample sets, is configured to:
subtracting the prediction probability of the prediction result corresponding to the first sample set by using 1 to obtain a first factor;
marking quality parameters corresponding to the first sample set are used as second factors;
calculating the product of the first factor and the second factor as the sample value corresponding to the first sample set;
wherein the first sample set is any one of a plurality of sample sets.
In an optional embodiment, when performing marking information prediction on sample data included in each of the plurality of sample sets to obtain prediction probabilities of prediction results corresponding to each of the sample data, the processor 41 is configured to:
inputting sample data contained in each of the plurality of sample sets into a model to be promoted;
and respectively predicting the result of the sample data contained in each of the plurality of sample sets by using the model to be promoted so as to obtain the prediction probability of the prediction result corresponding to each sample data.
In an alternative embodiment, the processor 41, when batch-selecting a target sample set meeting a preset requirement from the plurality of sample sets according to the sample values corresponding to the plurality of sample sets, is configured to:
and selecting a target sample set meeting preset requirements from the plurality of sample sets according to the sample values corresponding to the plurality of sample sets respectively from at least one selection dimension.
In an alternative embodiment, the at least one selection dimension includes a total amount dimension, and the processor 41, when selecting the target sample set meeting the preset requirement from the plurality of sample sets based on the sample values corresponding to the plurality of sample sets, is configured to:
and selecting N sample sets with the maximum sample value from the plurality of sample sets as target sample sets, wherein N is a preset total selection amount.
In an alternative embodiment, the at least one selected dimension includes a marking quality dimension, and the processor 41, when selecting the target sample set meeting the preset requirement from the plurality of sample sets based on the sample values corresponding to the plurality of sample sets, is configured to:
dividing the plurality of sample sets into a correct marking sample set and an incorrect marking sample set according to the sample values corresponding to the plurality of sample sets respectively;
and according to the preset occupation ratio requirement of the error marking sample set, discarding part of the error marking sample sets in the plurality of sample sets to obtain a target sample set meeting the occupation ratio requirement.
In an alternative embodiment, the at least one selection dimension includes a sample balance dimension, and the processor 41, when selecting a target sample set meeting a preset requirement from the plurality of sample sets based on the sample values corresponding to the plurality of sample sets, is configured to:
grouping a plurality of sample sets according to the marking information to obtain a plurality of groups of sample sets, wherein the sample sets in different groups contain different marking information;
respectively determining the number of sample sets required to be selected under a plurality of groups of sample sets according to a preset sample balance proportion;
and selecting sample sets matched with the number of the sample sets required to be selected under the multiple groups of sample sets respectively as target sample sets based on the sample values corresponding to the multiple sample sets respectively.
In an alternative embodiment, the processor 41, when selecting, as the target sample set, sample sets under multiple sets of sample sets that match the number of sample sets required to be selected under the multiple sets of sample sets based on the sample values corresponding to the sample sets, is configured to:
based on the sample values corresponding to the plurality of sample sets, respectively carrying out in-group sequencing on the plurality of groups of sample sets;
and under the multiple groups of sample sets, respectively selecting the sample set which is matched with the number of the sample sets needing to be selected under the multiple groups of sample sets and has the maximum sample value as a target sample set.
In an alternative embodiment, processor 41 is further configured to:
if the ratio of the wrong marking sample set in the target sample set does not meet the preset ratio requirement;
discarding part of the mistakenly marked sample sets from the determined target sample set so as to enable the remaining target sample set to meet the proportion requirement;
wherein the incorrectly marked sample set is determined according to the sample value of the sample set.
In an alternative embodiment, the target sample set for batch selection is on the order of tens of thousands or more.
In an alternative embodiment, the processor 41, when training the model to be lifted based on the target sample set, is configured to:
and inputting the target sample set into the model to be lifted at one time to train the model to be lifted.
Further, as shown in fig. 4, the computing device further includes: communication components 42, power components 43, and the like. Only some of the components are schematically shown in fig. 4, and the computing device is not meant to include only the components shown in fig. 4.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computing device in the foregoing method embodiments when executed.
The communication component in fig. 4 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply assembly of fig. 4 described above provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (18)

1. A method of model optimization, comprising:
obtaining a plurality of sample sets, wherein the sample sets comprise sample data and marking information;
selecting target sample sets meeting preset requirements in batches from the sample sets on the basis of sample data and marking information contained in the sample sets;
and training a model to be lifted according to the target sample set.
2. The method of claim 1, wherein the batch selecting a target sample set meeting preset requirements from the sample sets based on the sample data and the marking information included in each of the sample sets comprises:
calculating sample values corresponding to the sample sets according to the sample data and the marking information contained in the sample sets;
and selecting target sample sets meeting preset requirements in batches from the sample sets according to the sample values corresponding to the sample sets respectively.
3. The method according to claim 2, wherein the calculating the sample value corresponding to each of the plurality of sample sets according to the sample data and the marking information included in each of the plurality of sample sets comprises:
respectively carrying out marking information prediction on sample data contained in the plurality of sample sets to obtain prediction probabilities of prediction results corresponding to the sample data;
respectively determining marking quality parameters corresponding to the sample sets according to marking information contained in the sample sets;
and calculating the sample value corresponding to each of the plurality of sample sets according to the marking quality parameters corresponding to each of the plurality of sample sets and the prediction probability of the prediction result.
4. The method according to claim 3, wherein the determining marking quality parameters corresponding to the plurality of sample sets respectively according to the marking information included in the plurality of sample sets respectively comprises:
and determining marking quality parameters corresponding to the plurality of sample sets according to the prediction results and the marking information corresponding to the sample data.
5. The method of claim 4, wherein calculating the marking quality parameters of each of the plurality of sample sets based on the prediction results and the marking information corresponding to each of the sample data comprises:
if the prediction result corresponding to the first sample data is consistent with the marking information, determining that the marking quality parameter of the sample set where the first sample data is located is correct for marking;
if the prediction result corresponding to the first sample data is inconsistent with the marking information, determining that the marking quality parameter of the sample set where the first sample data is located is a marking error;
wherein, the first sample data is sample data contained in any one of the sample sets.
6. The method according to any one of claims 3-5, wherein calculating the sample value corresponding to each of the plurality of sample sets according to the marking quality parameter and the prediction probability of the prediction result corresponding to each of the plurality of sample sets comprises:
subtracting the prediction probability of the prediction result corresponding to the first sample set by using 1 to obtain a first factor;
marking quality parameters corresponding to the first sample set are used as second factors;
calculating the product of the first factor and the second factor as the sample value corresponding to the first sample set;
wherein the first sample set is any one of the several sample sets.
7. The method according to claim 3, wherein the performing marking information prediction on the sample data included in each of the plurality of sample sets to obtain prediction probabilities of prediction results corresponding to each of the sample data respectively comprises:
inputting sample data contained in each of the plurality of sample sets into the model to be promoted;
and respectively predicting the result of the sample data contained in the sample sets by using the model to be promoted so as to obtain the prediction probability of the prediction result corresponding to each sample data.
8. The method of claim 2, wherein the selecting a target sample set meeting a preset requirement in batch from the sample sets according to the sample values corresponding to the sample sets respectively comprises:
and selecting a target sample set meeting preset requirements from at least one selection dimension in the plurality of sample sets based on the sample values corresponding to the plurality of sample sets respectively.
9. The method of claim 8, wherein the at least one selection dimension comprises a total volume dimension, and wherein selecting a target sample set from the plurality of sample sets that meets a preset requirement based on the sample value corresponding to each of the plurality of sample sets comprises:
and selecting N sample sets with the maximum sample value from the plurality of sample sets as target sample sets, wherein N is a preset total selection amount.
10. The method of claim 8, wherein the at least one selection dimension comprises a marking quality dimension, and wherein selecting a target sample set from the plurality of sample sets that meets a predetermined requirement based on the sample value corresponding to each of the plurality of sample sets comprises:
dividing the plurality of sample sets into a correct marking sample set and an incorrect marking sample set according to the sample values corresponding to the plurality of sample sets respectively;
and according to the preset proportion requirement of the wrong marking sample set, discarding part of the wrong marking sample sets in the plurality of sample sets to obtain a target sample set meeting the proportion requirement.
11. The method of claim 8, wherein the at least one selection dimension comprises a sample equalization dimension, and wherein selecting a target sample set from the plurality of sample sets that meets a preset requirement based on the sample value corresponding to each of the plurality of sample sets comprises:
grouping the plurality of sample sets according to the marking information to obtain a plurality of groups of sample sets, wherein the sample sets in different groups contain different marking information;
respectively determining the number of sample sets required to be selected under the multiple groups of sample sets according to a preset sample balance proportion;
and selecting sample sets matched with the number of the sample sets required to be selected under the multiple groups of sample sets as target sample sets respectively based on the sample values corresponding to the sample sets.
12. The method according to claim 11, wherein the selecting, as the target sample set, the sample sets that match the number of the sample sets required to be selected below the plurality of sample sets respectively based on the sample values corresponding to the plurality of sample sets respectively comprises:
respectively carrying out in-group ordering on the multiple groups of sample sets based on the sample values corresponding to the multiple sample sets;
and under the multiple groups of sample sets, respectively selecting the sample set which is matched with the number of the sample sets needing to be selected under the multiple groups of sample sets and has the maximum sample value as a target sample set.
13. The method of claim 11, further comprising:
if the ratio of the wrong marking sample set in the target sample set does not meet the preset ratio requirement;
discarding part of the mistakenly marked sample sets from the determined target sample set so that the remaining target sample set meets the proportion requirement;
and determining the error marking sample set according to the sample value of the sample set.
14. The method of claim 1, wherein the target sample set for batch selection is on the order of tens of thousands or more.
15. The method of claim 14, wherein training the model to be lifted based on the set of target samples comprises:
and inputting the target sample set into the model to be lifted at one time to train the model to be lifted.
16. The method of claim 1, wherein the model to be lifted comprises one or more of a residual error network (Resnet) model, a Visual Geometry Group (VGG) model, or an inceptionV3 model.
17. A computing device comprising a memory and a processor;
the memory is to store one or more computer instructions;
the processor is coupled with the memory for executing the one or more computer instructions for:
obtaining a plurality of sample sets, wherein the sample sets comprise sample data and marking information;
selecting target sample sets meeting preset requirements in batches from the sample sets on the basis of sample data and marking information contained in the sample sets;
and training a model to be lifted according to the target sample set.
18. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform the model optimization method of any one of claims 1-16.
CN202010550559.7A 2020-06-16 2020-06-16 Model optimization method, device and storage medium Pending CN113807528A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010550559.7A CN113807528A (en) 2020-06-16 2020-06-16 Model optimization method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010550559.7A CN113807528A (en) 2020-06-16 2020-06-16 Model optimization method, device and storage medium

Publications (1)

Publication Number Publication Date
CN113807528A true CN113807528A (en) 2021-12-17

Family

ID=78943324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010550559.7A Pending CN113807528A (en) 2020-06-16 2020-06-16 Model optimization method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113807528A (en)

Similar Documents

Publication Publication Date Title
CN110348580B (en) Method and device for constructing GBDT model, and prediction method and device
CN106909594B (en) Information pushing method and device
CN109299344A (en) The generation method of order models, the sort method of search result, device and equipment
CN105718490A (en) Method and device for updating classifying model
CN110163647A (en) A kind of data processing method and device
CN103150696A (en) Method and device for selecting potential customer of target value-added service
CN108288208A (en) The displaying object of image content-based determines method, apparatus, medium and equipment
US20160328466A1 (en) Label filters for large scale multi-label classification
CN111652661B (en) Mobile phone client user loss early warning processing method
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN112464106B (en) Object recommendation method and device
CN117315237A (en) Method and device for determining target detection model and storage medium
CN112860416A (en) Annotating task assignment strategy method and device
CN113807528A (en) Model optimization method, device and storage medium
CN111831892A (en) Information recommendation method, information recommendation device, server and storage medium
CN112667869A (en) Data processing method, device, system and storage medium
CN107506463B (en) Data classification and processing method and equipment
CN115080824A (en) Target word mining method and device, electronic equipment and storage medium
CN112487295A (en) 5G package pushing method and device, electronic equipment and computer storage medium
CN114723469A (en) Method, system and electronic device for user satisfaction degree prediction and attribution
CN112784083A (en) Method and device for acquiring category prediction model and feature extraction model
CN113538025B (en) Replacement prediction method and device for terminal equipment
CN112487277A (en) Data distribution method and device, readable storage medium and electronic equipment
CN112016582A (en) Dish recommending method and device
CN113849101B (en) Information processing method, information processing device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40063991

Country of ref document: HK