CN110070143B

CN110070143B - Method, device and equipment for acquiring training data and storage medium

Info

Publication number: CN110070143B
Application number: CN201910356202.2A
Authority: CN
Inventors: 张志伟; 李焱; 吴丽军
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2021-07-16
Anticipated expiration: 2039-04-29
Also published as: CN110070143A

Abstract

The disclosure relates to a method, a device, equipment and a storage medium for acquiring training data, wherein the method comprises the following steps: acquiring a target training data subset, wherein the target training data subset is any one of a plurality of training data subsets of an initial training data set; in the training data subsets of the initial training data set, acquiring a first reference number of training data subsets except the target training data subsets; acquiring a second reference number of training data in each training data subset from the first reference number of training data subsets to obtain training data of a first reference number group; and adding the training data of the first reference quantity group into a target training data subset, and acquiring target training data for training the machine learning model based on the updated target training data subset and the rest training data subsets in the initial training data set. The method and the device can improve the identification accuracy of the model and reduce the time cost for acquiring the training data.

Description

Method, device and equipment for acquiring training data and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for acquiring training data.

Background

The deep learning method is widely applied to the related fields of video images, voice recognition, natural language processing and the like. Taking a Convolutional Neural Network (CNN) as an example, the CNN greatly improves the machine identification accuracy due to its superior fitting capability and end-to-end global optimization capability. For example, after the image classification task applies the CNN, the image classification accuracy is greatly improved, but the improvement range is not satisfactory. Because the recognition accuracy of a model depends on the cleanliness of the training data, i.e., the proportion of the noise data in the training data. The less noisy data in the training data, the higher the prediction accuracy of the model. Therefore, it is often necessary to pre-process the training data set containing noisy data.

In the related art, the method for acquiring training data includes: cleansing the data using a data cleansing algorithm; or manually labeling the data in the training data set again.

However, the above method for acquiring training data has the following disadvantages: at present, a general data cleaning algorithm does not exist, and a special cleaning algorithm strategy needs to be formulated for data in different application scenes, so that the data cleaning processing process is long in time consumption and high in cost; and the manual labeling cost is high and the treatment period is long.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a device, and a storage medium for acquiring training data, which can overcome the problems of long time and high cost of the related art in the manner of acquiring training data.

According to a first aspect of embodiments of the present disclosure, there is provided a method of acquiring training data, including: acquiring a target training data subset, wherein the target training data subset is any one of a plurality of training data subsets of an initial training data set, and each training data subset of the plurality of training data subsets corresponds to a category label;

obtaining a first reference number of training data subsets, excluding the target training data subset, in the training data subsets of the initial training data set;

acquiring a second reference number of training data in each training data subset from the first reference number of training data subsets to obtain training data of a first reference number group;

and adding the training data of the first reference quantity group into the target training data subset to obtain an updated target training data subset, and acquiring target training data for training a machine learning model based on the updated target training data subset and the remaining training data subsets in the initial training data set.

Optionally, the second reference number is determined according to a reference ratio, the number of training data subsets in the initial training data set, and the number of training data included in each training data subset, and the reference ratio is used for determining the number of increased training data.

Optionally, before the obtaining the target training data subset, the method further includes:

acquiring an initial training data set, and dividing the initial training data set into a plurality of training data subsets;

the obtaining target training data for training a machine learning model based on the updated target training data subset and the remaining training data subsets in the initial training data set includes:

selecting one or more training data subsets from the remaining training data subsets in the initial training data set, and processing the selected training data subsets in a manner of processing the target training data subsets to obtain updated training data subsets;

and combining the updated training data subset, the updated target training data subset and the non-updated training data subset in the initial training data set to obtain an updated training data set, and taking the training data contained in the updated training data set as the target training data for training the machine learning model.

Optionally, after obtaining the target training data for training the machine learning model based on the updated target training data subset and the remaining training data subsets in the initial training data set, the method further includes:

training a machine learning model by using the target training data;

and when the accuracy of the obtained machine learning model is smaller than or equal to the target value, re-acquiring the target training data until the accuracy of the obtained machine learning model is larger than the target value.

Optionally, the training data subsets are obtained according to label information of training data in the initial training data set, and the label information of the training data represents categories of the training data.

Optionally, the plurality of training data subsets are obtained by clustering the physical characteristic data according to the obtained physical characteristic information of the training data in the initial training data set.

According to a second aspect of embodiments of the present disclosure, there is provided an apparatus for acquiring training data, the apparatus comprising:

a first obtaining module configured to perform obtaining a target training data subset, where the target training data subset is any one of a plurality of training data subsets of an initial training data set, and each of the plurality of training data subsets corresponds to a category label;

a second obtaining module configured to perform obtaining, among the training data subsets of the initial training data set, a first reference number of training data subsets other than the target training data subset;

a third obtaining module configured to perform the training data obtaining process on the training data subsets of the first reference number, and obtain training data of the second reference number in each of the training data subsets to obtain training data of the first reference number group;

a training data obtaining module configured to add the training data of the first reference number group to the target training data subset to obtain an updated target training data subset, and obtain target training data for training a machine learning model based on the updated target training data subset and the remaining training data subsets in the initial training data set.

Optionally, the first obtaining module is further configured to perform obtaining an initial training data set, dividing the initial training data set into a plurality of training data subsets;

the training data acquisition module is configured to select one or more training data subsets from the remaining training data subsets in the initial training data set, and process the selected training data subsets in a manner of processing the target training data subsets to obtain updated training data subsets; and combining the updated training data subset, the updated target training data subset and the non-updated training data subset in the initial training data set to obtain an updated training data set, and taking the training data contained in the updated training data set as the target training data for training the machine learning model.

Optionally, the apparatus further comprises:

a training module configured to perform training of a machine learning model using the training data;

Optionally, the plurality of subsets of data to be trained are obtained according to label information of training data in the initial training data set, and the label information of the training data represents a category of the training data.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to perform the method of the first aspect or any of its possible implementations.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium comprising: the instructions in the storage medium, when executed by a processor of the terminal, enable the terminal to perform the method of the first aspect or any of its possible implementations.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program (product) comprising: computer program code which, when run by a computer, causes the computer to perform the method of the above aspects.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

according to the method for acquiring training data provided by the embodiment of the disclosure, data are acquired in the same initial training data set, the acquired training data are added to the target training data subset, and the target training data of the training machine learning model are acquired by using the target training data subset after data are added and the rest training data subsets in the initial training data set, so that the influence of original noise data on model training can be counteracted through the added noise data, the recognition accuracy of the model is further improved, and meanwhile, the time cost for acquiring the training data and the labor and financial cost are also reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic diagram illustrating a method of acquiring training data in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method of obtaining training data in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of obtaining training data in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating an apparatus for acquiring training data in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment;

fig. 6 is a diagram illustrating a terminal according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The background of the application of the embodiments of the present application will be described first. In machine learning model training, the resulting training data used in the machine learning model typically contains noisy data. If the cleanliness of the training data samples does not meet the requirements, the use of training data whose cleanliness does not meet the requirements may affect the recognition accuracy of the machine learning model. In order to provide a more intuitive understanding of the methods provided by the embodiments of the present application, the following examples are provided. As shown in fig. 1, when machine learning model training is performed based on training data including noise data, the target recognition result obtained from the positive samples in the training data is in the direction F2, and the recognition result obtained from the noise data in the training data is in the direction F1. Then the resultant force generated at this time is in the F direction. It can be seen that the recognition result obtained by the training data containing the noise data may deviate from the target recognition result F2, that is, the training data containing the noise data may affect the recognition accuracy of the machine learning model. Also as shown in fig. 1, if some noise data is additionally added to the training data, when the recognition result of the newly added noise data is in the direction of F3, the resultant of F3 and F will cause the machine learning model recognition result to be in the direction of F2, that is, the same direction as the target recognition result obtained by the positive sample. It can be seen that the recognition result of the machine learning model can be improved by adding the noise data.

In accordance with the principles described above, FIG. 2 is a flow chart illustrating a method of acquiring training data in accordance with an exemplary embodiment. As shown in fig. 2, the method for acquiring training data is used in a terminal, and includes the following steps:

in step S21, a target training data subset is obtained, where the target training data subset is any one of a plurality of training data subsets of the initial training data set, and each of the plurality of training data subsets corresponds to a category label.

For example, the target training data subset may be determined according to the data type that needs to be identified by the actual machine learning model. For example, if the recognition accuracy of the apple is trained, the training data subset of the apple is used as the target training data subset. Any one or more of the plurality of training data subsets of the initial training data set may also be used as the target training data subsets, respectively. One skilled in the art can set the target training data subset according to the actual model training needs.

As an alternative embodiment of the present application, the method for dividing the initial training data set into a plurality of training data subsets may be obtained according to label information of training data in the initial training data set, where the label information of the training data represents a category of the training data. Wherein the label information of the training data is available as a class label of the training data.

Illustratively, as will be appreciated by those skilled in the art, in order to train a machine learning model, training data in a training data set used for machine learning model training needs to be labeled. For clarity of description of the solution of the embodiment of the present application, taking a machine learning model that can be used for identifying fruit as an example, it is assumed that by this machine learning model for identifying fruit, plum, banana and apple can be identified. Namely, the machine learning model is informed by means of labels what data is apple, what data is banana and what data is plum. The machine learning model for identifying the fruits meets the requirement of a certain identification rate through training of a large amount of training data. Therefore, in the process of dividing the training data subsets into the initial training data set, the data marked as plums can be divided into a training data subset through the class labels formed by labeling, the data marked as apples is divided into a training data subset, and the data marked as bananas is divided into a training data subset.

As an optional embodiment of the present application, the method for dividing the initial training data set into a plurality of training data subsets may further be that the plurality of training data subsets perform clustering on the physical characteristic information of the training data in the obtained training data set according to the physical characteristic information.

For example, for the training data of the machine learning model for identifying the fruit, the physical characteristic information may be the shape of the fruit, the color of the fruit, or other physical characteristic information for classification. For example, when the physical feature information is the shape of a fruit, the training data subsets into which the shape of the fruit can be divided by clustering may be three types. Classifying data presenting semi-arc characteristics into one class; classifying data which presents a circular characteristic and has a circular radius larger than a certain value into a class; data that exhibits a circular characteristic with a circular radius less than or equal to a certain value are classified into one category.

The particular type of physical information used can be determined by one skilled in the art based on the actual classification requirements. For example, if only two classes are desired, the subset of data to be trained may be divided by color features. The feature of "yellow color" may be classified into one category and the feature of "red color" may be classified into one category. Namely, the number of the classified categories can be controlled by controlling different physical characteristic information. The clustering mode adopted by the embodiment of the application can be a K-Means algorithm or a DBSCAN algorithm. The embodiment of the present application takes the training data subsets divided as the training data subset of plum, the training data subset of apple, and the training data subset of banana, but the present application is not limited thereto.

In step S22, among the training data subsets of the initial training data set, a first reference number of training data subsets other than the target training data subset are acquired.

Exemplarily, assuming that the training data subset of plum is taken as the target training data subset, the training data subsets contained in addition to the training data subset of plum are respectively the training data subset of apple and the training data subset of banana, for a total of two training data subsets. A first reference number of training data subsets other than the target training data subset is obtained, and the first reference number may be one or two. The training data subset of the apple can be obtained, the training data subset of the banana can be obtained, and the two training data subsets can be obtained simultaneously. When the number of training data subsets into which the initial training data set is divided is larger, the number of remaining training data subsets for any one of the target training data subsets is larger, and the selection category of the first reference number is larger. One skilled in the art can obtain any one or more of the remaining training data subsets according to actual usage requirements.

In step S23, a second reference number of training data in each of the training data subsets is obtained from the first reference number of training data subsets, so as to obtain training data of the first reference number group.

Illustratively, when the training data subset of the apple and the training data subset of the banana are acquired simultaneously, a second reference number of training data are acquired in the training data subset of the apple and the training data subset of the banana, respectively, to obtain two sets of training data. Wherein the value of the second reference amount of training data acquired in each of the training data subsets may be confirmed according to the type of the training data subset. For example, for the training data subset of plum, the apple and banana have more similar appearance characteristics to plum, so 20 training data can be obtained from the training data subset of apple, 10 training data can be obtained from the training data subset of banana, and the second reference number can be 20 and 10 respectively. Or 10 training data can be respectively obtained from the training data subsets of the apple and the banana without distinguishing the types of the training data subsets. The number of training data acquired in each training data subset can be determined by those skilled in the art according to actual use requirements.

As an alternative embodiment of the present application, the second reference number may be determined according to a reference ratio by which the increased number of training data is determined, the number of training data subsets in the initial training data set, and the number of training data included in each training data subset.

For example, the expression form of formula (1) or (2) may be selected by those skilled in the art according to actual experiments, and the embodiment of the present application is not limited thereto.

In the alternative, the first and second sets of the first,

in the formula, N is the number of training data subsets into which the initial training data set is divided; m is the number of training data contained in each training data subset; k is a reference ratio for determining the number of training data added, e.g., k is 0.1.

Illustratively, also according to the above machine learning model for identifying fruits, assuming that 50 training data are included in the training data subset of the apple, the second reference number required to be obtained in the training data subset of the apple is as shown in the following formula (3). When the calculated second reference number is a decimal number, the embodiment of the present application adopts rounding up, that is, when the second reference number is 2.5, 3 training data are taken from the training data subset of the apple. Those skilled in the art can adopt a rounding-down manner according to actual use requirements, and the embodiment of the present application is not limited thereto.

The method for obtaining the amount of training data from the training data subset of bananas is the same as that of the apple, and is not described herein. That is, when the training data subset of bananas contains 100 training data, the second reference number obtained by the above calculation method may be 10 or 5.

In step S24, the training data of the first reference number group is added to the target training data subset to obtain an updated target training data subset, and target training data for training the machine learning model is obtained based on the updated target training data subset and the remaining training data subsets in the initial training data set.

Illustratively, when using equation (1), 5 training data are obtained in the training data subset of apples and added to the training data subset of plums. Obtaining 10 training data from the training data subset of bananas, adding the training data to the training data subset of plums, and finally obtaining the updated training data subset for identifying plums, wherein the formula (4) can be specifically shown.

In the formula: train_iTraining data included in the training data subset i after the training data is added;

training data included in the training data subset i before the training data is added;

training data obtained from any remaining subset of training data; and U are a union operation.

For example, if training data included in the training data subset of plums before adding the training data is 100, after adding the two sets of training data, training data included in the training data subset of plums after adding the training data is 115. Target training data for training the machine learning model is obtained based on the 115 training data and the remaining subset of training data in the initial training data set.

For example, as known to those skilled in the art, the recognition result obtained by the machine recognition model is a category corresponding to the maximum recognition probability corresponding to the machine recognition model. That is, given a fruit to be identified, when the probability that the fruit is plum is 60%, the probability that the fruit is apple is 50%, and the probability that the fruit is banana is 30% obtained by the machine learning model, the fruit type given by the machine learning model is plum. If the recognition results obtained at least 90 times in the 100 recognition processes are correct, the accuracy of the recognition model can reach more than 90% theoretically.

For such high accuracy, how the solution described in the examples of the present application achieves. An example will now be described. When the training data subset of the plum is taken as the target training data subset, a certain amount of training data is acquired from the training data subset of the banana and the training data subset of the apple respectively and added into the training data subset of the plum. If the noise data contained in the training data subset of plums is likely to have a picture of an apple before the training data is added to the training data subset of plums, the label of the picture is plums. Such training data may mislead the machine learning model, which may cause the machine learning model to recognize an apple to be recognized as a plum. If the training data in the newly added banana training data subset or apple training data subset contains the same apple photo, the photo is labeled as banana or apple. Then, when the machine learning model learns, the same photo, which may be a plum, an apple, or a banana, is learned, and the machine learning model may have three recognition results. Thus, the probability that an apple is recognized as a plum by the machine learning model is relatively reduced. At this time, although training data for correctly identifying plums is not added, the probability of identifying apples as plums is reduced. The accuracy of plum recognition is improved by the comparison of the probability values.

According to the method for acquiring the training data, provided by the embodiment of the application, because the probability that the training data of the same machine learning model is marked to be wrong is higher, for example, a cross error occurs, the plum is marked to be an apple or a banana, the apple is marked to be a banana or a plum, and the banana is marked to be an apple or a plum. The acquisition of training data resulting in the direction of F3 in fig. 1 may be improved by selecting training data in a subset of training data in the same set of data to be trained as the added noise data. If the noise data is added, the model identification accuracy is not improved, and the noise data can be obtained again until the model identification accuracy meets the requirement.

According to the method provided by the embodiment, the training data subset for identifying the apple and the banana can be obtained in the same way. The embodiments of the present application are not described herein again. The target training data in the obtained training data subset of the plums can be directly used for training a single fruit recognition model only used for recognizing the plums; target training data in the training data subset of the plums can also be added into the initial training data set for training the multi-class water recognition model so as to improve the recognition accuracy of the machine learning model.

According to the method for acquiring the training data, data are acquired in the same initial training data set, the acquired training data are added to the target training data subset, the target training data subset after the data are added and the remaining training data subset in the initial training data set are used for acquiring the target training data of the training machine learning model, so that the influence of original noise data on model training can be offset through the added noise data, the identification accuracy of the model is improved, and meanwhile, the time cost for acquiring the training data and the labor and financial cost are reduced.

As an alternative embodiment of the present application, as shown in fig. 3, before step S21, the method further includes:

in step S20, an initial training data set is obtained, which is divided into a plurality of training data subsets.

The process of dividing the initial training data set into a plurality of training data subsets may refer to step S21 in the above-mentioned embodiment, and is not described herein again.

In step S24, the obtaining of the target training data for training the machine learning model based on the updated target training data subset and the remaining training data subsets in the initial training data set includes:

in step S241, one or more training data subsets are selected from the remaining training data subsets in the initial training data set, and the selected training data subsets are processed by processing the target training data subsets, so as to obtain updated training data subsets.

The updated training data subset obtaining method may refer to steps S21-S24 in the above embodiment, and is not described herein again.

In step S242, the updated training data subset, the updated target training data subset, and the non-updated training data subset in the initial training data set are merged to obtain an updated training data set, and the training data included in the updated training data set is used as the target training data for training the machine learning model.

Illustratively, the method of the merging process may be as shown in the following formula (5). That is, the updated training data subsets, the updated target training data subsets, and the non-updated training data subsets in the initial training data set are added to a training data set in which each of the training data subsets is stored in a file of the training data set in the form of a subfile. When training a model using a training data set, files of the entire training data set are imported together.

In the formula: DB_noiseTraining data contained in the training data set; u is a sum operation; train_iTraining data included in the training data subset i after data addition; and N is the number of the training data subsets after data increase.

For example, for the machine learning model for identifying fruit in the present application, when the number of training data of the updated plums obtained according to the above embodiment is 115, the number of training data of the apples is 115, and the number of training data of the bananas is 150, the number of target training data for training the machine learning model is 380.

For example, according to the scheme described in the above embodiment, the recognition accuracy of a single recognition model can be improved by adding the training data subsets of the training data. Therefore, for the multiple types of recognition models, if one or more to-be-trained data subsets obtained in the manner of the previous embodiment are added to the training data set, the recognition accuracy of the multiple types of recognition models can be improved as well.

As an alternative embodiment of the present application, after obtaining the target training data, the method includes:

first, a machine learning model is trained using target training data.

And secondly, when the accuracy rate of the obtained machine learning model is smaller than or equal to the target value, the target training data is obtained again until the accuracy rate of the obtained machine learning model is larger than the target value.

Illustratively, the target value may require validation based on actual machine learning model identification. Such as a machine learning model that requires accuracy of the recognition result, the target value may be set to 90%. Or the recognition result obtained by training the machine learning model according to the initial training data set. For example, when the recognition result of the machine learning model obtained by using the initial training set is 70%, the target value may be set to 70%, or may be set to any value larger than 70%. The target value can be set by those skilled in the art according to actual needs, and the embodiments of the present application are not limited herein.

For example, by the method provided in this embodiment, after target training data is obtained, the machine learning model is trained by using the target training data, so as to obtain the accuracy of the trained machine recognition model. If the current machine learning model uses the target training data obtained by the method of the embodiment, and after model training is performed, the obtained recognition accuracy does not meet the set target value, the method of the embodiment is used again to obtain the target training data again until the model recognition accuracy meets the set target value.

FIG. 4 is a block diagram illustrating an apparatus for acquiring training data in accordance with an exemplary embodiment. Referring to fig. 4, the apparatus includes a first obtaining module 41, a second obtaining module 42, a third obtaining module 43, and a training data obtaining module 44.

A first obtaining module 41, configured to perform obtaining a target training data subset, where the target training data subset is any one of a plurality of training data subsets of the initial training data set, and each of the plurality of training data subsets corresponds to a category label;

a second obtaining module 42 configured to perform obtaining, among the training data subsets of the initial training data set, a first reference number of training data subsets other than the target training data subset;

a third obtaining module 43, configured to obtain a second reference number of training data in each of the training data subsets in the first reference number of training data subsets, so as to obtain training data of the first reference number group;

a training data obtaining module 44 configured to add the training data of the first reference number group to the target training data subset to obtain an updated target training data subset, and obtain target training data for training the machine learning model based on the updated target training data subset and the remaining training data subsets in the initial training data set.

As an alternative embodiment of the present application, the second reference number is determined according to a reference ratio, the number of training data subsets in the initial training data set and the number of training data included in each training data subset, the reference ratio being used for determining the number of training data to be added.

As an optional embodiment of the present application, the first obtaining module 41 is further configured to perform obtaining an initial training data set, dividing the initial training data set into a plurality of training data subsets;

a training data obtaining module 44 configured to select one or more training data subsets from the remaining training data subsets in the initial training data set, and process the selected training data subsets in a manner of processing the target training data subsets to obtain updated training data subsets; and combining the updated training data subset, the updated target training data subset and the un-updated training data subset in the training data set to obtain an updated training data set, and taking the training data contained in the updated training data set as the target training data for training the machine learning model.

As an optional embodiment of the present application, after obtaining the target training data, the apparatus further includes:

a training module configured to perform training of a machine learning model with target training data;

and when the accuracy rate of the obtained machine learning model is smaller than or equal to the target value, re-acquiring the target training data until the accuracy rate of the obtained machine learning model is larger than the target value.

As an optional implementation manner of the present application, the plurality of training data subsets are obtained according to label information of the training data in the initial training data set, and the label information of the training data represents a category of the training data.

As an optional implementation manner of the present application, the plurality of training data subsets are obtained by clustering physical characteristic data according to physical characteristic information of training data in the obtained initial training data set.

The device that the training data set obtained that this application embodiment provided, through obtain data and add the training data who obtains to the target training data subset in same initial training data set, the target training data subset after utilizing the increase data and the training data subset that remains in the initial training data set obtain the target training data of training machine learning model, thereby can offset the influence of original noise data to the model training through the noise data that increases, and then improve the discernment accuracy of model, the time cost and the manpower and financial cost of obtaining the training data have also been reduced simultaneously.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 5, the electronic device includes:

a processor 51;

a memory 52 for storing the processor-executable instructions; the processor 51 and the memory 52 are connected by a communication bus 53.

Wherein the processor 51 is configured to execute the method for acquiring training data according to the above embodiment.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, in an alternative embodiment, the memory may include both read-only memory and random access memory, and provide instructions and data to the processor. The memory may also include non-volatile random access memory. For example, the memory may also store device type information.

Fig. 6 is a block diagram illustrating a terminal 600 according to an example embodiment. The terminal 600 may be: a smartphone, a tablet, a laptop, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the method of obtaining training data provided by the method embodiments herein.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera 606, an audio circuit 607, a positioning component 608, and a power supply 609.

The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.

The positioning component 608 is used for positioning the current geographic Location of the terminal 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 609 is used to provide power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the touch screen display 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 613 may be disposed on a side frame of the terminal 600 and/or on a lower layer of the touch display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the touch display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of touch display 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 605 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually decreases, the processor 601 controls the touch display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually becomes larger, the processor 601 controls the touch display 605 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The present application provides a computer program, which when executed by a computer, may cause the processor or the computer to perform the respective steps and/or procedures corresponding to the above-described method embodiments.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of obtaining training data, the method being applied to an electronic device, the method comprising:

the electronic equipment acquires a target training data subset according to a fruit type to be identified by a machine learning model for identifying the fruit, the target training data subset is a training data subset corresponding to the fruit type to be identified in a plurality of training data subsets of an initial training data set, each training data subset in the plurality of training data subsets corresponds to a category label, the plurality of training data subsets are acquired according to label information of training data in the initial training data set, the label information of the training data represents the category of the training data, or the plurality of training data subsets are obtained by clustering physical characteristic information according to the physical characteristic information of the training data in the initial training data set, and the physical characteristic information comprises the shape of the fruit, the color of the fruit or other physical characteristic information for classification, the training data comprises photos corresponding to the fruit types;

the electronic equipment acquires a first reference number of training data subsets except the target training data subset from the training data subsets of the initial training data set;

the electronic equipment acquires a second reference number of training data in each training data subset from the first reference number of training data subsets to obtain a first reference number group of training data;

the electronic equipment adds the training data of the first reference quantity group to the target training data subset to obtain an updated target training data subset;

selecting one or more training data subsets from the training data subsets in the initial training data set except for the target training data subset before updating, and processing the selected training data subsets in a mode of processing the target training data subsets to obtain updated training data subsets; and combining the updated training data subset, the updated target training data subset and the non-updated training data subset in the initial training data set to obtain an updated training data set, wherein the training data contained in the updated training data set is used as the target training data for training the machine learning model, and the machine learning model is a multi-type recognition model.

2. The method of claim 1, wherein the second reference number is determined according to a reference ratio, the number of training data subsets in the initial training data set, and the number of training data included in each training data subset, wherein the reference ratio is used for determining the number of training data to be added.

3. The method of obtaining training data according to claim 1, wherein before the electronic device obtains the target training data subset according to the fruit type to be identified by the machine learning model for identifying fruit, the method further comprises:

an initial training data set is obtained, and the initial training data set is divided into a plurality of training data subsets.

4. The method according to claim 1, wherein after the training data included in the updated training data set is used as target training data for training the machine learning model, the method further comprises:

training a machine learning model by using the target training data;

5. An apparatus for acquiring training data, the apparatus comprising:

a first obtaining module, configured to obtain a target training data subset according to a fruit type to be identified by a machine learning model for identifying fruit, where the target training data subset is a training data subset corresponding to the fruit type to be identified in a plurality of training data subsets of an initial training data set, each of the plurality of training data subsets corresponds to a category label, the plurality of training data subsets are obtained according to label information of training data in the initial training data set, and the label information of the training data represents a category of the training data, or the plurality of training data subsets are obtained by clustering physical feature information of the training data in the initial training data set, where the physical feature information includes a shape of the fruit, a color of the fruit, or other physical feature information for classification, the training data comprises photos corresponding to the fruit types;

a training data acquisition module configured to add the training data of the first reference number group to the target training data subset to obtain an updated target training data subset, select one or more training data subsets from the training data subsets in the initial training data set, except for the target training data subset before updating, and process the selected training data subset in a manner of processing the target training data subset to obtain an updated training data subset; and combining the updated training data subset, the updated target training data subset and the non-updated training data subset in the initial training data set to obtain an updated training data set, wherein the training data contained in the updated training data set is used as the target training data for training the machine learning model, and the machine learning model is a multi-type recognition model.

6. The apparatus for acquiring training data according to claim 5, wherein the second reference number is determined according to a reference ratio, the number of training data subsets in the initial training data set, and the number of training data included in each training data subset, and the reference ratio is used for determining the number of increased training data.

7. The apparatus of claim 5, wherein the first obtaining module is further configured to perform obtaining an initial training data set, the initial training data set being partitioned into a plurality of training data subsets.

8. The apparatus for obtaining training data according to claim 5, wherein the apparatus further comprises:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the method of acquiring training data of any one of claims 1-4.

10. A computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of a terminal, enable the terminal to perform the method of acquiring training data of any of claims 1-4.