CN113743426A

CN113743426A - Training method, device, equipment and computer readable storage medium

Info

Publication number: CN113743426A
Application number: CN202010462418.XA
Authority: CN
Inventors: 张梦阳; 王兵; 周宇飞; 郑宜海
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2021-12-03
Also published as: WO2021238586A1

Abstract

The application provides a training method, a training device and related equipment. Before a model to be trained is trained, the method firstly determines the difficult weight distribution of samples in a first sample set, then adjusts the first sample set according to the task target of the model to be trained and the difficult weight distribution to obtain a second sample set, and finally trains the model to be trained by using the second sample set. By using the training method provided by the application, the complexity of the task target of the model to be trained and the difficulty weight of each sample can be combined, the difficult samples with proper quantity are selected for training, the problem that the training precision of the model to be trained is bottleneck due to the fact that the difficult samples are difficult to label is solved, and the training precision of the model to be trained is improved.

Description

Training method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and more particularly, to a training method, apparatus, device, and computer-readable storage medium.

Background

With the continuous development of scientific technology, AI models are widely used in video images, speech recognition, natural language processing, and other related fields. AI models typically require a large number of samples to be trained, and difficult sample (Hard samples) images tend to work better than simple sample images when training AI models. The difficult sample refers to a sample which is difficult to distinguish by the model, and specifically may be a sample which is blurred, overexposed, and unclear in outline, or may be a sample which is very similar to other samples. In the learning process of the AI model, even a large number of simple samples are difficult to greatly improve the prediction accuracy of the AI model, and the difficult sample images often bring a large improvement to the prediction accuracy of the AI model.

However, in the training process of the AI model, screening the difficult samples manually is a project which wastes manpower and time, the precision of labeling the difficult samples by computing equipment is poor, and the difficulty in labeling the difficult samples causes a bottleneck in the training precision of the AI model.

Disclosure of Invention

The application provides a training method, device equipment and a computer readable storage medium, which are used for solving the problem that the training precision of an AI model is bottleneck due to the fact that a current difficult sample is difficult to label.

In a first aspect, a training method is provided, which includes the following steps:

the method comprises the steps of obtaining a first sample set, adjusting the first sample set to obtain a second sample set according to a task target of a model to be trained and the difficult weight distribution of samples in the first sample set after the difficult weight distribution of the samples in the first sample set is determined, and finally training the model to be trained by utilizing the second sample set.

By implementing the method described in the first aspect, before training the model to be trained, the difficult weight distribution of the samples in the first sample set is determined, then the first sample set is adjusted according to the task target of the model to be trained and the difficult weight distribution to obtain the second sample set, and finally the model to be trained is trained by using the second sample set. Therefore, in the process of training the model to be trained, the training device 200 can select a proper number of difficult samples to train by combining the complexity of the task target of the model to be trained and the difficult weight of each sample, so that the problem that the training precision of the model to be trained is bottleneck due to the fact that the difficult samples are difficult to label is solved, and the training precision of the model to be trained is improved.

In a possible implementation manner, the task target of the model to be trained includes one or more of an application scenario after the model to be trained is trained, an event type to be implemented after the model to be trained is trained, and a training precision target of the model to be trained. The model to be trained is an AI model, for example: a neural network model.

By implementing the implementation mode, the task targets of different models to be trained are different in difficulty degree, when one model to be trained for realizing a simple task target is trained, such as face recognition of an indoor gate scene, a second sample set used in training can contain more samples with small difficulty weight, a large number of simple samples are used for training, a small number of difficult samples are used for auxiliary training, and the training speed can be improved while the task target is realized; on the contrary, if a model to be trained for realizing a complex task target is trained, such as face recognition in an outdoor video monitoring scene, a second sample set used during training can contain samples with more difficult weights, so that a large number of difficult samples are used for training, a small number of simple samples are used for assisting training, the model to be trained can be more concentrated in learning of the difficult samples, the training precision of the model to be trained is pertinently improved, and the purpose of reinforcement learning is achieved.

In a possible implementation manner, the first sample set is adjusted according to the task target of the model to be trained and the difficult weight distribution of the samples in the first sample set, and when the second sample set is obtained, the target difficult weight distribution to be reached by the sample set for training the model to be trained can be determined according to the task target of the model to be trained and the difficult weight distribution of the samples in the first sample set, and then the number of the samples in the first sample set is increased or decreased, or the difficult weight of some samples in the first sample set is changed to obtain the second sample set, wherein the difficult weight distribution of the samples in the second sample set is equal to or similar to the target difficult weight distribution.

In a specific implementation, a training device for training a model to be trained may maintain a corresponding relation library, where a corresponding relation between a plurality of task targets and a plurality of target difficulty weight distributions is stored in the corresponding relation library, so that after the training device determines a first sample set difficulty weight distribution, a target difficulty weight distribution corresponding to the task target may be determined according to the task target of the model to be trained and the corresponding relation library, and thus, a difficulty weight distribution of the first sample set is adjusted according to a difference between the difficulty weight distribution of the first sample set and the target difficulty weight distribution, and a second sample set for training the model to be trained is obtained.

When the first sample set is adjusted according to the target difficulty weight distribution, the difficulty weight distribution of the samples in the second sample set obtained after the adjustment may be equal to the target difficulty weight distribution or may be similar to the target difficulty weight distribution. Wherein, approximating the target difficulty weight distribution means that a difference between the difficulty weight distributions of the second sample set and the target difficulty weight distribution is less than a third threshold h₃For example, if the third threshold h₃Again taking the above example as an example, the target difficult weight distribution is a difficult sample: simple samples 3: 2: 1.5, the difficult weight distribution of the first sample set is the difficult samples: the simple sample is 3:7, and after the first sample set is adjusted, the difficulty weight distribution of the second sample set obtained may also be 8:5 or 1.6, where the difference between the difficulty weight distribution of the second sample set and the target difficulty weight distribution is 1.6-1.5 or 0.1, which is smaller than the third threshold h₃0.2. It should be understood that the above examples are illustrative only and are not to be construed as being particularly limiting.

By implementing the implementation mode, when the difficult weight distribution of the first sample set is adjusted, the target difficult weight distribution is determined according to the task target of the model to be trained, and then the difficult weight distribution of the first sample set is adjusted according to the target difficult weight distribution, so that the obtained second sample set is more suitable for training the model to be trained, the training precision of the model to be trained can be improved in a targeted manner, and the purpose of reinforcement learning is achieved.

In a possible implementation manner, when determining the difficult weight distribution of the samples in the first sample set, each sample in the first sample set may be input to the feature extraction model to obtain feature information of each sample, then the reference feature information of the multiple types of samples in the first sample set is determined according to the feature information of each sample, and then the difficult weight corresponding to each sample is determined based on the similarity between the feature information of each sample and the reference feature information of the corresponding type, so as to obtain the difficult weight distribution of the samples in the first sample set.

In a specific implementation, the feature extraction model is used to extract feature information of the sample, and may be an AI model trained before the first sample set is obtained, and the feature extraction model may use any one of AI models for extracting features of the sample, which are already in the industry, such as a feature descriptor for target detection (HOG), a Local Binary Pattern (LBP), a convolutional layer of a convolutional neural network, and the like, which is not limited in this application. In addition, the sources of the sample set may include a mobile phone or a monitoring camera, local offline data, internet public data, and the like, and the present application is not particularly limited.

The feature information of each sample extracted by the feature extraction model may be specifically a feature vector or a feature matrix. Assuming that the number of samples in the first class of samples is n, the feature information obtained after each sample in the class of samples is input into the feature extraction model is respectively a feature vector B₁,B₂,…,B_nThen, the reference feature information of the class sample may be an average vector a of the n feature vectors, or may be a feature vector B closest to the average vector a in the n feature vectors_jWhere j ∈ n, it may also beAfter the feature vector of each type of sample is mapped to the 2D space, the feature vector corresponding to the point of the most densely distributed area is determined as the reference feature information of the type of sample, and the determination method of the reference feature information is not limited in the application.

It should be noted that, in the case that the feature information is a feature vector, the difficulty weight of each sample may be determined according to a distance between the feature vector of each sample and the reference feature vector of the corresponding category, the greater the distance between the feature vector of each sample and the reference feature vector of the corresponding category, the smaller the similarity between the feature vector of the sample and the reference feature vector of the corresponding category, the greater the difficulty weight of the sample, that is, the distance and the difficulty weight are in a direct proportional relationship, and the similarity and the difficulty weight are in an inverse proportional relationship.

The implementation mode is implemented, the feature extraction model is used for extracting the feature information of each sample and the information of each type of sample in the sample set, the difficulty weight of each sample is determined according to the similarity or distance between the feature information of each sample and the reference feature information of the corresponding type, the difficulty weight distribution of the first sample set is obtained based on the features of the samples, the difficulty weight distribution is independent of the structure of the training model and the training method, the difficulty degree of the samples can be reflected well, the labeling precision of the difficult samples is high, and therefore the problem that the training precision of the AI model is bottleneck due to the fact that the difficult samples are difficult to label is solved.

In a possible implementation manner, before training the model to be trained by using the second sample set, the method may further include the following steps: and adjusting the weight parameters of the loss function of the model to be trained according to the difficult weight distribution of the samples in the second sample set.

For example, if the common Loss function for the task target of the model to be trained is Loss0, the weight parameter of the sample is α_iThen the formula of the Loss function Loss1 for the model to be trained can be as follows:

Loss1＝α_iLoss

by implementing the implementation mode, after the sample with the larger difficulty weight is input into the model to be trained, the obtained loss function value is larger, the model to be trained is subjected to back propagation supervision training by using the loss function, so that the model to be trained is more inclined to use the difficult sample to perform parameter updating on the model to be trained, the characteristics of the difficult sample can be more intensively learned, the difficult sample is more inclined to use to perform parameter updating, the aim of performing intensive training on the model to be trained aiming at the difficult sample is fulfilled, and the characteristic expression capability of the model on the difficult sample is further improved.

In a second aspect, there is provided an exercise device, the device comprising: an obtaining unit configured to obtain a first sample set, where the first sample set includes a plurality of samples; a determining unit for determining a difficult weight distribution of the samples in the first set of samples; the adjusting unit is used for adjusting the first sample set according to the task target of the model to be trained and the difficult weight distribution of the samples in the first sample set to obtain a second sample set; and the training unit is used for training the model to be trained by utilizing the second sample set.

In a possible implementation manner, the task target of the model to be trained includes one or more of an application scenario after the model to be trained is trained, an event type to be implemented after the model to be trained is trained, and a training precision target of the model to be trained.

In a possible implementation manner, the adjusting unit is specifically configured to determine, according to a task target of the model to be trained and the difficult weight distribution of the samples in the first sample set, a target difficult weight distribution that the sample set used for training the model to be trained should reach; the adjusting unit is used for increasing or decreasing the number of samples in the first sample set, or changing the difficulty weight of a part of samples in the first sample set to obtain a second sample set, wherein the difficulty weight distribution of the samples in the second sample set is equal to or approximate to the target difficulty weight distribution.

In a possible implementation manner, the determining unit is specifically configured to input each sample of the first sample set to the feature extraction model, and obtain feature information of each sample, where each sample corresponds to one category; the determining unit is used for determining the reference characteristic information of multiple classes of samples in the first sample set according to the characteristic information of each sample, wherein each class of samples comprises at least one sample with the same class; the determining unit is used for determining the difficulty weight corresponding to each sample based on the similarity between the characteristic information of each sample and the reference characteristic information of the corresponding category; the determining unit is used for obtaining the difficulty weight distribution of the samples in the first sample set according to the difficulty weight of each sample in the first sample set.

In a possible implementation manner, before the model to be trained is trained by using the second sample set, the adjusting unit is further configured to adjust the weight parameter of the loss function of the model to be trained according to the difficult weight distribution of the samples in the second sample set.

In a third aspect, a computer program product is provided, comprising a computer program which, when read and executed by a computing device, implements the method as described in the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium comprising instructions which, when executed on a computing device, cause the computing device to carry out the method as described in the first aspect.

In a fifth aspect, there is provided a computing device comprising a processor and a memory, the processor executing code in the memory to implement the method as described in the first aspect.

In a sixth aspect, a chip is provided that includes a memory and a processor; the memory is coupled to the processor, which comprises a modem processor, the memory for storing computer program code comprising computer instructions, which the processor reads from the memory to cause the chip to perform the method as described in the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a schematic diagram of a training and prediction system;

FIG. 2 is a diagram of an example of a difficult sample in an application scenario;

FIG. 3 is a schematic diagram of a training apparatus provided herein;

FIG. 4 is a schematic flow chart of a training method provided herein;

FIG. 5 is a schematic diagram of a convolutional neural network;

FIG. 6 is an exemplary diagram of reference feature information for each type of sample in an application scenario;

FIG. 7 is an exemplary illustration of a distribution of data from a first sample set to a second sample set in an application scenario;

FIG. 8 is a schematic flow chart of a training method provided in the present application in an application scenario;

FIG. 9 is a schematic diagram of a chip structure provided in the present application;

fig. 10 is a schematic structural diagram of a computing device provided in the present application.

Detailed Description

The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.

First, some terms related to the present application are explained.

Loss Function (Loss Function): the loss function is used to estimate how inconsistent the predicted value f (x) of the model is from the true value y, and is typically a non-negative real-valued function. The smaller the value of the loss function, the better the robustness of the model, and the loss function is generally used to adjust the network learning direction. For example, in a 5-class problem, if the input presentation classification result of a picture is class 4, the true value of the picture may be (0,0,0,1,0), and if the prediction result of the model is f (x) (0.1,0.15,0.05,0.6,0.1), the value of the loss function is-log (0.6). If the threshold value of the loss function is-log (0.9), the model still needs further training, and the net learning direction is adjusted through the loss function, so that the final model with good performance can be obtained. The above formula of the loss function is only used for illustration, and the application does not limit the specific formula of the loss function.

Feature Extraction (Feature Extraction): a method of transforming a measurement to emphasize that the measurement has a representative characteristic.

And (3) back propagation: the neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in a training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, an error loss occurs when an input signal is transmitted forward until an output signal is output, and parameters in an initial neural network model are updated by back-propagating error loss information (such as a value of a loss function), so that the error loss converges. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining optimal neural network model parameters, such as a weight matrix.

Next, an application scenario related to the present application will be explained.

Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a computing device controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. The application scenarios of artificial intelligence are very wide, such as face recognition, vehicle recognition, pedestrian re-recognition, data processing application, and the like.

An AI model is a set of mathematical methods to implement AI. The trained AI model may be trained using a large number of samples to obtain a predictive capability, for example, to train a spam-classified model, in the training phase, a sample set labeled with a plurality of spam labels and a plurality of non-spam labels may be trained on a neural network, the neural network continuously captures the association between these mails and labels to self-adjust and refine network structure parameters, and then in the prediction phase, the neural network may classify new mails without labels as to whether they are spam. It is to be understood that the above description is intended to be illustrative, and not restrictive.

The structure of the training and prediction system for the AI model is explained below. As shown in fig. 1, fig. 1 is an architecture diagram of an AI model training and prediction system, the system 100 is a system architecture commonly used in the AI field, and the system 1000 includes: training device 100, executive device 200, database 130, client device 140, and data collection device 150. The various components of the system 100 may be interconnected by a network, which may be a wired network, a wireless network, or a combination thereof. Wherein the content of the first and second substances,

the training device 200 may be a physical server, such as an X86 server, an ARM server, or the like, or may be a Virtual Machine (VM) implemented based on a general physical server and combining Network Function Virtualization (NFV) technology, where the VM refers to a complete computer system having a complete hardware system function and operating in a completely isolated environment, such as a Virtual Machine in cloud data, and the present application is not limited specifically.

The training device 200 is configured to train the model to be trained using the sample set in the database 130, obtain a target model, and send the target model to the execution device 100. Specifically, the training device 200 may compare the output data of the model to be trained with the label of the sample data when the model to be trained is trained, and continuously adjust the structural parameters of the model to be trained according to the comparison result until the output data of the training device 200 and the label of the sample data are smaller than a certain threshold, thereby completing the training of the model to be trained and obtaining the target model. The model to be trained and the target model may be any AI model, such as a neural network model for classifying spam in the above example, an image classification model, a semantic recognition model, and the like, which is not limited in the present application. The sample sets maintained in the database 130 are not necessarily all from the data acquisition device 150, and may be received from other devices. The database 130 may be a local database, or may be a database of a cloud or other third party platform, which is not specifically limited in this application.

The execution device 100 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality/virtual reality, a vehicle-mounted terminal, or the like, and may also be a server or a cloud device, and the application is not particularly limited.

The execution apparatus 100 is used to implement various functions according to the target model trained by the training apparatus 200. Specifically, in fig. 1, a user may input data to the execution apparatus 100 through the client apparatus 140, and predict the input data using the target model to obtain an output result. The execution device 100 may return the output result to the client device 140, so that the user can view the result output by the execution device 100, where the specific presentation form may be a specific manner such as display, sound, and action; the execution device 100 may also store the output result as a new sample in the database 130, so that the training device 200 may use the new sample to readjust the structural parameters of the target model, thereby improving the performance of the target model.

For example, the client device 140 is a mobile phone terminal, the execution device 100 is a cloud device, the trained target model is a semantic recognition model, the user can input text data to be recognized to the execution device through the client device 140, the execution device 100 performs semantic recognition on the text data to be recognized through the target model, and returns a semantic recognition result to the client device 140, so that the user can view the semantic recognition result through the user device (mobile phone terminal).

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 1, the database 130 is an external memory with respect to the training device 200, in other cases, the database 130 may be placed in the training device 200, and the present application is not limited in particular.

In summary, the implementation of various applications in the AI field depends on the AI model, and different functions, such as classification, recognition, detection, etc., are implemented by the AI model, and the AI model needs to be trained in advance by using a sample set before being put into an execution device for use. When training an AI model using sample data of a sample set, the effect of difficult samples (HardSamples) tends to be greater than simple samples. The difficult samples refer to samples which are difficult to distinguish by an AI model, and can be divided into two types, wherein one type is fuzzy, overexposed and unclear outline samples, and the AI model can be wrongly predicted no matter what kind of algorithm is adopted by the samples, and no matter what kind of initialization parameters are adopted by the samples; another class is samples that are very similar to other samples, resulting in AI models that are difficult to distinguish, which are only difficult samples for current AI models, but not for all AI models. For example, as shown in fig. 1, in training an AI model for identifying a pet dog, a "doll",

samples

1, 3, and 5 labeled "cookies" in fig. 1 are indistinguishable from the outline shape of a "doll" and are therefore difficult samples. In training an AI model for identifying "cats",

samples

1, 3, and 5 with the "cookie" label in fig. 1 are better distinguishable from "cats" and are therefore not difficult samples. It should be understood that FIG. 1 is for illustration only and should not be construed as being particularly limiting.

In the training process of the AI model, even a large number of simple samples are difficult to greatly improve the prediction accuracy of the AI model, and the difficult samples often bring a large improvement to the prediction accuracy of the AI model. Therefore, how to screen out difficult samples from a large number of training samples to perform intensive training on the AI model has been a great concern for researchers.

In general, difficult samples can be obtained by way of manual labeling or machine labeling. The manual labeling of the difficult sample is not only a project which wastes manpower and time, but also labeling precision cannot be guaranteed due to personal cognitive deviation, work fatigue and other reasons, and the computing device obtains sample characteristics by checking each pixel, so that some samples which are not similar to human eyes in appearance may belong to the difficult sample for the AI model, and the precision of the manual labeling of the difficult sample is poor.

Although the sample with difficult machine labeling is simple, convenient and quick, the labeling precision is poor. If only the samples with the wrong prediction are marked as difficult samples, many difficult samples will be missed because it is also possible that the correct samples are predicted as difficult samples. For example, the label of sample A is (0,1) representingThe sample belongs to class 2 if sample A is input into the classification model M₁The prediction vector after (0.4,0.6), i.e. the classification model M₁The classification result of (2) shows that the sample a belongs to the class 2 and the classification result is correct, but the difference between the prediction vector (0.4,0.6) and the sample label (0,1) is large, the value of the loss function is also large, and the sample a is a sample with correct classification but is also the classification model M₁Samples which are difficult to distinguish belong to difficult samples. Therefore, the sample with the wrong classification is taken as a difficult sample, and the labeling precision is poor. If a sample with a large loss function value is taken as a difficult sample, some simple samples may be wrongly labeled as the difficult samples, and it can be known from the foregoing that the loss function is used to measure the degree of inconsistency between the predicted value and the actual value of the model, and the reason why the predicted value and the actual value of the model are inconsistent is many, possibly because the sample is really a difficult sample, or the selected model structure or the training method is defective, and the sample itself is not a difficult sample difficult to distinguish. Therefore, the sample having a large value of the loss function is regarded as a difficult sample, and the labeling accuracy is also poor.

In summary, in the training process of the AI model, screening the difficult samples manually is a labor and time wasting project, the precision of marking the difficult samples by the computing equipment is poor, and the bottleneck appears in the training precision of the AI model due to the current situation that the difficult samples are difficult to mark.

In order to solve the problem that the difficult samples are difficult to label, which causes bottleneck to the training accuracy of the AI model, the present application provides a training device 200, and the training device 200 may be applied to the AI model training and prediction system shown in fig. 1, as shown in fig. 3, the training device 200 may include an obtaining unit 210, a determining unit 220, an adjusting unit 230, a database 140, a database 150, and a training unit 240.

The obtaining unit 210 is configured to obtain a first sample set, where the first sample set includes a plurality of samples.

The determination unit 220 is configured to determine a difficult weight distribution of the samples in the first set of samples.

Wherein the higher the difficulty weight of a sample, the more difficult the sample belongs to for the model to be trained, and the lower the difficulty weight of the sample, the more simple the sample belongs to for the model to be trained. The distribution of the difficult weights of the samples refers to a ratio of the number of samples corresponding to each difficult weight, for example, the number of samples with a difficult sample weight of 1 in the sample set a is 1000, the number of samples with a difficult sample weight of 2 is 2000, the number of samples with a difficult sample weight of 3 is 3000, and then the distribution of the difficult weights of the samples in the sample set a is 1:2:3, it should be understood that the above examples are merely illustrative and not intended to be limiting.

In a specific implementation, the determining unit 220 may determine the difficulty weight distribution of the samples in the first sample set using a feature extraction model in the database 150. Specifically, the determining unit 220 may perform feature extraction on each sample in the first sample set by using a feature extraction model in the database 150 to obtain feature information of each sample, then determine reference feature information of each sample according to the feature information of each sample in each class of samples, and finally determine a difficulty weight corresponding to each sample according to a similarity between the feature information of each sample and the reference feature information of the corresponding class. For example, the determining unit 220 may input the first sample set into the feature extraction model in the database 150, obtain a feature vector of each sample in the first sample set, then use an average vector of the feature vectors of each type of samples as the reference feature information of the type of samples, and finally determine the difficulty weight corresponding to each sample according to the similarity or distance between the feature vector of each sample and the average vector of the corresponding type.

The adjusting unit 230 is configured to adjust the difficulty weight distribution of the first sample set according to the difficulty weight of each sample and the task target of the model to be trained, so as to obtain a second sample set.

In an embodiment, the adjusting unit 230 may determine, according to a task target of the model to be trained, a target difficulty weight distribution that a sample set used for training the model to be trained should reach, and then increase or decrease the number of samples in the first sample set according to the difficulty weight distribution of the samples in the first sample set, or change some samples in the first sample set to obtain a second sample set, so that the difficulty weight distribution of the samples in the second sample set is equal to or similar to the target difficulty weight distribution.

For example, if the first sample set has 3 difficult weights (α respectively)₁、α₂And alpha₃) The difficult weight distribution of the first sample set is alpha₁：α₂：α₃1:2:3, the adjusting unit 230 may determine the target difficulty weight distribution for the model to be trained as α according to the difficulty level of the task target of the model to be trained₁：α₂：α₃1:1:1, and then adjusting the first sample set, the difficulty weight can be reduced to α₂And alpha₃May also be increased by a difficult weight of α₁Such that the difficult weight distribution of the first sample set becomes 1:2:3, thereby obtaining a second sample set. It should be understood that the above examples are for illustrative purposes only and the present application is not intended to limit the number of difficult weightings.

It can be understood that a model to be trained for realizing a simple task target is trained, for example, face recognition of an indoor gate scene, and a second sample set used during training can contain more samples with small difficulty weights, so that a large number of simple samples are used for training, a small number of difficult samples are used for auxiliary training, and the training speed can be improved while the task target is realized; on the contrary, if a model to be trained for realizing a complex task target is trained, such as face recognition in an outdoor video monitoring scene, a second sample set used during training can contain samples with more difficult weights, so that a large number of difficult samples are used for training, a small number of simple samples are used for assisting training, the model to be trained can be more concentrated in learning of the difficult samples, the training precision of the model to be trained is pertinently improved, and the purpose of reinforcement learning is achieved.

The training unit 240 is configured to train the model to be trained by using the second sample set, so as to obtain a trained target model.

In a specific implementation, before the training unit 240 trains the training model using the second sample set, the weight parameter of the loss function of the model to be trained may be adjusted according to the difficult weight of each sample in the second sample set, and then when the model to be trained is trained using the second sample set, the model to be trained is subjected to back propagation supervised training according to the loss function, so as to obtain the target model. In the loss function of the model to be trained, the difficulty weight of each sample of the second sample set is in a direct proportional relation with the value of the loss function, so that the influence of the difficult sample with the large difficulty weight on the loss function is larger, the AI model can more concentrate on learning the characteristics of the difficult sample, and is more inclined to update the parameters by using the difficult sample, thereby achieving the purpose of performing intensive training on the difficult sample by the model to be trained, and improving the performance of the model to be trained.

It should be noted that the positional relationship between the devices and units shown in fig. 3 is not intended to be limiting, for example, in fig. 3, the database 130 is an external memory with respect to the training device 200, and in other cases, the database 130 may be disposed in the training device 200; database 140 and database 150 are internal memory to training device 200, and in other cases, database 140 and/or database 150 may be located in external memory.

In summary, the training apparatus 200 provided in this embodiment of the present application may determine the difficult weight distribution of the samples in the first sample set before training the model to be trained, adjust the first sample set according to the task target of the model to be trained and the difficult weight distribution, obtain the second sample set, and finally train the model to be trained using the second sample set. In this way, the training device 200 can select a proper number of difficult samples to train in combination with the complexity of the task target of the model to be trained and the difficult weight of each sample in the process of training the model to be trained, thereby solving the problem that the difficulty and security are difficult to mark and lead to bottleneck in the training precision of the AI model, and improving the training precision of the AI model.

The following describes in detail the training method provided by the present application, which is applied to the training apparatus 200 in the embodiment of fig. 3. As shown in fig. 4, the method may include the steps of:

s210: the training device 200 obtains a first set of samples, wherein the first set of samples comprises a plurality of samples.

The sample may be any form of sample, such as an image sample, a text sample, a voice sample, a biometric data (e.g., fingerprint, iris) sample, and so forth. The first sample set may include samples of multiple categories, for example, all samples of one category are "cookie" images, all samples of one category are images of the same face at various angles, all samples of one category are images of vehicles of the same model at different angles and in different scenes, and the first sample set may be specifically classified according to a task target of a model to be trained. For example, if the task of the model to be trained is face recognition, the face images of the same person can be classified into a category, such as a category 1 of small and bright face photos and a category 2 of small and stiff face photos. It should be understood that the above examples are illustrative only and are not to be construed as being particularly limiting.

S220: the training apparatus 200 determines a difficulty weight distribution for the samples in the first sample set.

In an embodiment, after feature extraction is performed on each sample through the feature extraction model, the difficulty weight of each sample is determined according to the extracted feature information, and then the difficulty weight distribution of the samples in the first sample set is obtained. Specifically, each sample of the first sample set may be input into a feature extraction model, feature information of each sample is obtained, where each sample corresponds to one category, then reference feature information of multiple categories of samples in the first sample set is determined according to the feature information of each sample, where each category of samples includes at least one sample with the same category, a difficulty weight corresponding to each sample is determined based on a similarity between the feature information of each sample and the reference feature information of the corresponding category, and a difficulty weight distribution of the samples in the first sample set is obtained according to the difficulty weight of each sample in the first sample set. Step S220 will be described below in steps S221 to S224.

S230: the training device 200 adjusts the first sample set according to the task objective of the model to be trained and the difficult weight distribution of the samples in the first sample set, and obtains a second sample set.

The task target of the model to be trained comprises one or more of an application scene after the model to be trained is trained, an event type to be realized after the model to be trained is trained, and a training precision target of the model to be trained, for example, a face recognition model in the application scene of video monitoring and a face recognition model in the application scene of mobile phone unlocking are different in target difficult weight distribution of a required sample during training; the identification type and the clothing identification type are different when the target difficulty weight distribution of the required sample is distributed during model training; the target difficulty weight distribution of the required samples is different when training the models to be trained of the low training precision target and the high training precision target. It should be understood that the above examples are illustrative only and are not to be construed as being particularly limiting.

It can be understood that a model for realizing a simple task target is trained, for example, face recognition of an indoor gate scene, and a second sample set used in training may include more samples with small difficulty weights, so that a large number of simple samples are used for training, a small number of difficult samples are used for auxiliary training, and the training speed can be improved while the task target is realized; on the contrary, if a model for realizing a complex task target is trained, such as face recognition in an outdoor video monitoring scene, a second sample set used during training can contain more samples with large difficulty weights, so that a large number of difficult samples are used for training, a small number of simple samples are used for assisting training, the model to be trained can be more concentrated on learning of the difficult samples, the training precision of the model to be trained is improved in a targeted manner, and the purpose of reinforcement learning is achieved.

In a specific implementation, the training device 200 may maintain a corresponding relation library, where a corresponding relation between a plurality of task targets and a plurality of target difficulty weight distributions is stored in the corresponding relation library, so that after the training device 200 determines the first sample set difficulty weight distribution in the database 130, the training device may determine a target difficulty weight distribution corresponding to the task target according to the task target of the model to be trained and the corresponding relation library, and adjust the difficulty weight distribution of the first sample set according to the difference between the difficulty weight distribution of the first sample set and the target difficulty weight distribution, so as to obtain a second sample set for training the model to be trained. It should be noted that the correspondence library may be stored in an internal memory of the training device 200, or may be stored in an external memory of the training device 200, which may be determined by the processing capability and the storage capability of the training device, and the present application is not limited specifically.

In an embodiment, when the first sample set is adjusted according to the target difficulty weight distribution, the number of samples in the first sample set may be increased or decreased, or the difficulty weights of some samples in the first sample set may be changed to obtain the second sample set. For example, if the task target is face recognition in an outdoor video monitoring scene, determining that the target difficulty weight required by the task target is distributed as a difficulty sample according to the corresponding relation library: simple sample 3:2, where the difficult sample is that the difficulty weight α is higher than the first threshold h₁Simple samples are difficult weight alpha is lower than a second threshold h₂The first sample set P1 contains 10000 samples, the number of difficult samples is 3000, and the number of simple samples is 7000, that is, the difficult weight distribution of the first sample set is difficult samples: 7, when the difficult weight distribution is adjusted for the first sample set, 3000 difficult samples can be expanded into 6000 difficult samples in a data enhancement mode, 4000 simple samples are randomly selected from 7000 simple samples, the 6000 difficult samples and the 4000 simple samples form a second sample set P2, and the difficult weight distribution of the second sample set P2 is difficult samples: simple sample 3: 2.

When the first sample set is adjusted according to the target difficulty weight distribution, the difficulty weight distribution of the samples in the second sample set obtained after the adjustment may be equal to the target difficulty weight distribution or may be similar to the target difficulty weight distribution. Wherein approximating the target difficulty weight distribution refers to difficulty weights of the second sample setThe difference between the target difficulty weight distributions is less than a third threshold h₃For example, if the third threshold h₃Again taking the above example as an example, the target difficult weight distribution is a difficult sample: simple samples 3: 2: 1.5, the difficult weight distribution of the first sample set is the difficult samples: the simple sample is 3:7, and after the first sample set is adjusted, the difficulty weight distribution of the second sample set obtained may also be 8:5 or 1.6, where the difference between the difficulty weight distribution of the second sample set and the target difficulty weight distribution is 1.6-1.5 or 0.1, which is smaller than the third threshold h₃0.2. It should be understood that the above examples are illustrative only and are not to be construed as being particularly limiting.

As another example, as shown in the left histogram of FIG. 7, the training apparatus 200 counts the distribution of the difficulty weights of the entire first sample set based on the difficulty weight of each sample, and then calculates the difficulty weight α₁The number of samples is 3000 for 1, the difficulty weight α₂The number of samples is 2500 for 2, the difficult weight α₃The number of samples is 2000 for 3, the difficult weight α₄The number of samples is 1000 for 4, the difficult weight α₅The number of samples of 5 is 500, i.e. the difficult weight distribution of the first sample set is α₁：α₂：α₃：α₄：α₅6:5:4:2: 1. The distribution of the target difficulty weights required to assume the current task target is shown in the right hand histogram of FIG. 7, i.e., α₁：α₂：α₃：α₄：α₅That is, the number of samples whose difficulty weight α is 1 is 2500, the number of samples whose difficulty weight α is 2 is 2500, the number of samples whose difficulty weight α is 3 is 2000, the number of samples whose difficulty weight α is 4 is 1800, and the number of samples whose difficulty weight α is 5 is 1600. At this time, the number of samples having the difficulty weight α of 4 and the difficulty weight α of 5 is insufficient, so that the samples having the difficulty weight α of 4 and the difficulty weight α of 5 may be increased, and finally the difficulty weight distribution of the first sample set may be adjusted as shown in the right histogram of fig. 7, thereby obtaining the second sample set. It should be understood that fig. 7 is for illustration only and should not be construed as being particularly limiting.

In a specific implementation, the number of samples in the first sample set is increased, or the difficult weight of some samples in the first sample set is changed, which may be implemented by data enhancement, where the data enhancement may be to perform random perturbation on some difficult samples or simple samples to obtain more difficult samples or simple samples, where the random perturbation includes adding a noise point, changing illumination information, changing environment information (such as weather, background, time), and so on. The data enhancement may also be to obtain more difficult samples or simple samples after inputting part of the difficult samples or simple samples into a generation countermeasure network (GAN), where the GAN may include a decision network and a generation network, where the generation network is used to generate a picture according to the input data, and the decision network is used to decide whether the input picture is a real picture. In the training process of the GAN, the goal of generating the network is to generate a real picture as much as possible so that the output result of the discrimination network is real, the goal of the discrimination network is to discriminate an accurate result as much as possible, namely, the data result of the picture generated by the generation network is discriminated to be false, the two networks form a dynamic game process, and finally the trained GAN can generate a 'false-to-false' picture so as to obtain more difficult samples or simple samples.

S240: the training apparatus 200 trains the model to be trained using the second sample set.

In an embodiment, before the model to be trained is trained by using the second sample set, the weight parameter of the loss function of the model to be trained is adjusted according to the difficult weights of the samples in the second sample set, and then when the model to be trained is trained by using the second sample set, the model to be trained can be subjected to back propagation supervised training according to the loss function to obtain the trained model, wherein the difficult weight of each sample in the loss function is in a direct proportional relationship with the value of the loss function, so that after the sample with the larger difficult weight is input into the model to be trained, the obtained loss function value is larger, the model to be trained is subjected to back propagation supervised training by using the loss function, and the model to be trained can be more prone to parameter updating by using the difficult sample.

Specifically, if the Loss function commonly used for the task goal of the model to be trained is Loss0, the formula of the Loss function Loss1 of the model to be trained may be as follows:

loss1＝α_iLoss (1)

therefore, the difficult samples with large difficult weights have larger influence on the loss function, when the model to be trained is subjected to back propagation supervised training by using the loss function, the model to be trained can be more concentrated on learning the characteristics of the difficult samples, and the difficult samples are more prone to be utilized for parameter updating, so that the aim of performing enhanced training on the difficult samples by the model to be trained is fulfilled, and the characteristic expression capability of the model on the difficult samples is further improved. It should be understood that equation 3 is for illustration only, and the formula of the Loss function Loss1 of the model to be trained may be other Loss1 and α_iThe formula is in direct proportion, and the application is not limited in particular.

For example, if the formula of Loss0 is:

where w and b are parameters of the model to be trained, x is input data, y is output data, m is the number of input data, and n is the number of classes into which the model to be trained is classified, for example, if the model to be trained is a five-class model, then n is 5. During the training process of the model to be trained, the formula of Loss1 may be:

it should be understood that the above formula is only for illustration, and the specific formula of Loss0 may adopt any one of various Loss formulas existing in the industry, such as a mean square error Loss function, a cross entropy Loss function, etc., and the present application is not limited in particular.

The specific process of the training apparatus 200 determining the difficult weight distribution of the samples in the first sample set at the above step S220 is described in detail below. The steps can be divided into the following steps:

step S221: and inputting each sample of the first sample set into a feature extraction model, and obtaining feature information of each sample, wherein each sample corresponds to one category.

In a specific implementation, the feature information of each sample extracted by the feature extraction model may be specifically a feature vector or a feature matrix, for example, to facilitate better understanding of the present application, and the following description uniformly takes the feature information as a feature vector as an example for illustration. The feature vector is a numerical feature of the sample expressed in a vector form, and can represent the sample feature more effectively, and in general, the feature vector is a multi-dimensional vector, such as a 512-dimensional vector and a 1024-dimensional vector, and the specific dimension of the vector is not limited in the present application. It should be noted that the feature extraction model is used to extract a certain type of features of the sample, different feature extraction models extract different feature vectors for the same sample, the feature extraction model used to extract the attributes of the human face can extract the features of the sample a, such as eyes, nose, mouth, and the like, and the feature extraction model used to extract the attributes of the vehicle can extract the features of the sample a, such as wheels, steel materials, and the like. Therefore, the feature extraction model may be determined according to the task target of the model to be trained, if the model to be trained is a face recognition network, the feature extraction model used in step S221 is a feature extraction model for extracting the face attribute features, and if the model to be trained is a vehicle recognition network, the feature extraction model used in step S220 is a feature extraction model for extracting the vehicle attribute features.

It can be understood that the feature vectors obtained after the simple sample and the difficult sample are input into the feature extraction model are different, the quality of the feature vector extracted from the simple sample is good, and the quality of the feature vector extracted from the difficult sample is poor, wherein the quality of the feature vector depends on the capability of the feature vector to distinguish different classes of image samples, the good features should be rich in information and not affected by noise and a series of transformations, the class to which the sample belongs can be quickly obtained after the feature vectors are input into the classifier, and conversely, the feature information with poor quality is deficient, and the class to which the sample belongs is difficult to determine after the feature vectors are input into the classifier. For example, in the feature extraction model for extracting the attributes of the human face, when a simple sample is subjected to feature extraction, it can be easily extracted that the sample includes the features of the eyes, the nose and the mouth, and a difficult sample is difficult to extract whether the sample includes the features of the eyes, the nose and the mouth, so the feature vectors between the simple samples should be similar, and the feature vectors of the difficult sample are different from the feature vectors of the simple sample.

In a specific implementation, the feature extraction model in the database 150 is used to extract feature information of the sample, and may be an AI model trained before step S210, and the feature extraction model may use any one of AI models for extracting features of the sample, which are already in the industry, such as a feature descriptor for target detection (HOG), a Local Binary Pattern (LBP), a convolutional layer of a convolutional neural network, and the like, which is not limited in this application. In addition, the sources of the sample set may include a mobile phone or a monitoring camera, local offline data, internet public data, and the like, and the present application is not particularly limited.

The feature extraction model is exemplified below by taking a convolutional neural network as an example.

A Convolutional Neural Network (CNN) is a Deep neural Network with a Convolutional structure, and is a Deep Learning (Deep Learning) architecture, where the Deep Learning architecture refers to an algorithm learned through a computing device and performs Learning at multiple levels at different abstraction levels. As a deep learning architecture, CNN is a Feed-Forward (Feed-Forward) artificial neural network in which individual neurons respond to overlapping regions in an image input thereto. As shown in fig. 5, Convolutional Neural Network (CNN)300 may include an input layer 310, a convolutional layer/pooling layer 320, and a neural network layer 330, wherein the pooling layer is an optional network layer.

(1) Convolutional layer/pooling layer 320: as shown in FIG. 5, convolutional layer/pooling layer 320 may comprise layers such as 321-326, in one implementation, 321 is a convolutional layer, 322 is a pooling layer, 323 is a convolutional layer, 324 is a pooling layer, 325 is a convolutional layer, and 326 is a pooling layer; in another implementation, 321, 322 are convolutional layers, 323 are pooling layers, 324, 325 are convolutional layers, and 326 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 321 as an example, convolutional layer 321 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels, depending on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the Depth Dimension (Depth Dimension) of the weight matrix is the same as the Depth Dimension of the input image, and the weight matrix extends to the entire Depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unnecessary noise points in the image and the like, the dimensions of the multiple weight matrixes are the same, the dimensions of feature graphs extracted by the multiple weight matrixes with the same dimension are also the same, and the extracted feature graphs with the same dimension are combined to form the output of convolution operation to obtain a final feature vector.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract specific information from an input image to generate a feature vector, and then the feature vector is input to a neural network layer for classification processing, thereby helping the convolutional neural network 300 to perform correct prediction.

When convolutional neural network 300 has multiple convolutional layers, the initial convolutional layer (e.g., 321) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 300 increases, the more convolutional layers (e.g., 326) later extract more complex features, such as features with high levels of semantics, the more highly semantic features are suitable for the problem to be solved.

(2) A pooling layer: since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, i.e., 321-326 layers as illustrated by 320 in fig. 5, which may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

(3) The neural network layer 330:

after processing by convolutional layer/pooling layer 320, convolutional neural network 300 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 320 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 300 needs to generate one or a set of the number of classes of outputs needed using the neural network layer 330. Therefore, the neural network layer 330 may include a plurality of hidden layers (331, 332 to 33n shown in fig. 5) and an output layer 340, and parameters included in the hidden layers may be obtained by pre-training according to the related training data of a specific task type.

After the hidden layers in the neural network layer 330, i.e. the last layer of the whole convolutional neural network 300 is the output layer 340, the output layer 340 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 310 to 340 in fig. 5 is the forward propagation) of the whole convolutional neural network 300 is completed, the backward propagation (i.e. the propagation from 340 to 310 in fig. 5 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 300 and the error between the result output by the convolutional neural network 300 through the output layer and the ideal result.

In summary, the input layer 310 and the convolutional layer/pooling layer 320 are used to extract sample features to obtain feature vectors of samples, and the neural network layer 330 is used to classify input images according to the feature vectors extracted by the convolutional layer/pooling layer 320, so that the feature extraction model required by the present application can be simply understood as a convolutional neural network including only the convolutional layer/pooling layer 320 and not including the neural network layer 330. It should be understood that the above examples are illustrative only and are not to be construed as being particularly limiting.

Step S222: and determining reference characteristic information of multiple types of samples in the first sample set according to the characteristic information of each sample, wherein each type of sample comprises at least one sample with the same type.

For example, assuming that the number of samples in the first class of samples is n, the feature information of each sample in the class of samples is the feature vector B₁,B₂,…,B_nThen, the reference feature information of the sample may be an average vector a of the n vectors, or may be a vector B closest to the average vector a among the n vectors_jAnd j belongs to n, and similarly, vectors of other types of samples can be obtained, and when the reference characteristic information is expressed in a vector form, the reference characteristic information is also called as a reference characteristic vector. For example, if the feature information of each sample is a 512-dimensional feature vector, the multi-dimensional feature vector obtained in step S221 is mapped to a 2D space and plotted in a planar rectangular coordinate system in the form of coordinate points, and the reference feature information of each type of sample may be as shown in fig. 6. It should be understood that fig. 6 is only used for illustration, and the reference feature information of each type of sample may also be obtained by determining, as the reference feature information of the type of sample, the feature vector corresponding to the point in the most densely distributed area after the feature vector of each type of sample is mapped to the 2D space, where the method for determining the reference feature information is not limited in this application.

Step S223: and determining the difficulty weight corresponding to each sample based on the similarity between the characteristic information of each sample and the reference characteristic information of the corresponding category.

The greater the similarity between the feature information of each sample and the reference feature information of the corresponding category, the smaller the difficulty weight of the sample, that is, the inverse proportional relationship between the similarity and the difficulty weight, it can be understood that, in the case where the feature information is a feature vector, the difficulty weight of each sample can be determined according to the distance between the feature vector of each sample and the reference feature vector of the corresponding category, and the greater the distance between the feature vector of each sample and the reference feature vector of the corresponding category, the smaller the similarity between the feature vector of the sample and the reference feature vector of the corresponding category, the greater the difficulty weight of the sample, that is, the direct proportional relationship between the distance and the difficulty weight.

For example, if the feature vector obtained after each sample in the first class of samples is input into the feature extraction model is B₁,B₂,…,B_nWhere the reference feature vector is vector A, then it may be based on feature vector B₁And referenceThe distance between feature vectors A determines feature vectors B₁According to the feature vector B₂Determining a feature vector B from the distance to the reference feature vector A₂…, based on the feature vector B_nDetermining a feature vector B from the distance to the reference feature vector A_nThe difficult weight of (2). By analogy, the difficulty weight for each sample can be determined from the distance between each sample and the reference feature vector of the corresponding class.

In a specific implementation, the Distance between the feature vector of a sample and the reference feature vector may be a Cosine Distance (Cosine Distance), an Euclidean Distance (Euclidean Distance), a Manhattan Distance (Manhattan Distance), a Chebyshev Distance (Chebyshev Distance), a Minkowski Distance (Minkowski Distance), and the like, and the Similarity between the feature information of a sample and the reference feature information may be a Cosine Similarity (Cosine Similarity), an Adjusted Cosine Similarity (Adjusted Cosine Similarity), a Pearson Correlation Coefficient (Pearson Correlation Coefficient) and a jackcard Similarity Coefficient (Jaccard coeffient), and the like, and the present application is not limited in particular.

For example, a sample of a type has a reference feature vector of A and a feature vector of B_i＝{B₁,B₂,…,B_nThen refer to feature vector A and feature vector B_iFormula D of the distance between_iThe (cosine distance) may be:

the feature vector B of each sample can be determined based on equation (4)_iDistance D from reference feature vector A_i. It should be understood that the above formula 4 is only for illustration and should not be construed as a specific limitation.

Referring to the embodiment of fig. 5, the feature extraction model for extracting the sample features includes a plurality of weight matrices for extracting specific features, each of which can extract specific colors, specific edge information, and so on, so that for simple samplesIn other words, the weight matrix can well extract specific colors, specific edge information and the like, and feature vectors obtained by extracting different simple samples are very similar; for the difficult samples, the weight matrix may not be able to extract specific colors, specific edge information, etc., so the feature vectors extracted by the difficult samples are far from those extracted by the simple samples. In this way, the difficulty degree of each sample can be well determined by determining the distance between the feature vector extracted by each sample and the reference feature vector, the greater the distance between the feature vector of the sample and the reference feature vector, the smaller the similarity between the feature vector of the sample and the reference feature vector, the more difficult the sample belongs to, the greater the difficulty weight, and conversely, the smaller the distance between the feature vector of the sample and the reference feature vector, the greater the similarity between the feature vector of the sample and the reference feature vector, the more simple the sample belongs to, and the less difficult the weight. Therefore, the difficult weight α_iThe formula of (c) may be:

α_i＝T×D_ii＝1,2,…,n (5)

where T is a constant greater than 1, it should be understood that the above formula 5 is only used for illustration, and the formula of the difficulty weight α may be other formulas in which the difficulty weight α is in a direct proportional relationship with the distance D, and the present application is not particularly limited.

Similarly, if the difficulty weight α of the sample is determined according to the similarity S between the feature information of the sample and the old and female feature information, the formula of the difficulty weight may be:

α_i＝T-S_ii＝1,2,…,n (6)

it should be understood that the above formula 6 is only used for illustration, and the formula of the difficulty weight α may be other formulas in which the difficulty weight α is inversely proportional to the similarity S, and the present application is not particularly limited.

In one embodiment, in equations 5 and 6, the difficulty weight α_iThe constant T in (1) can be a tunable constant, and in particular, in the initial stage of training the model to be trained, T can be a larger constant, so that the difficult weight of the difficult samples is higher and the loss is higherThe larger the value of the loss function, the more biased the learning center of gravity of the model to be trained to the difficult sample. At the end of training the model to be trained, T may be made small appropriately, because the AI model tends to converge at this time, and it may not be necessary to favor difficult samples that are more time-consuming, thereby increasing the training speed.

Step S224: from the difficulty weight of each sample in the first sample set, a difficulty weight distribution of the samples in the first sample set is obtained.

It can be understood that the feature extraction model is used for extracting the feature vector of each sample and the vector of each type of sample in the sample set, and then the difficulty weight of each sample is determined according to the similarity or distance between the feature vector of each sample and the vector of the corresponding type, so that the difficulty weight distribution of the first sample set is obtained based on the features of the sample, is irrelevant to the structure of the training model and the training method, can well reflect the difficulty degree of the sample, and has high labeling precision of the difficult sample, thereby solving the problem that the training precision of the AI model is bottleneck due to the difficulty in labeling of the difficult sample.

In an embodiment, after the training apparatus obtains the difficult weight distributions of the samples in the first sample set, the difficult weight distributions of the first sample set may also be stored in the database 130, so that, after the difficult weight distributions of many sample sets are stored in the database 130, if the training apparatus needs to train the AI model, after determining the target difficult weight distribution according to the task target of the AI model to be trained, the sample set close to the target difficult weight distribution may be directly obtained from the database 130 to train the AI model to be trained. For example, the database 130 stores 3 sample sets, namely sample sets X1, X2 and X3, the database 130 further stores the difficulty weight distribution Y1 of the sample set X1 being 1:1, the difficulty weight distribution Y2 of the sample set X2 being 1:2, and the difficulty weight distribution Y3 of the sample set X3 being 1:5, the training apparatus 200 may obtain the target difficulty weight distribution Y0 corresponding to the task target of the model to be trained and the correspondence library in the foregoing content according to the task target, and then obtain the sample set whose difficulty weight distribution is closest to the target difficulty weight distribution Y0, namely the sample set X3, in the database 130. In this way, the training apparatus 200 may directly select a sample set that is the same as or similar to the target difficult weight distribution as the second sample set without performing step S230 to adjust the difficult weight distribution, and train the AI model to be trained, thereby further improving the training speed of the AI model.

In summary, the present application provides a model training method, which may determine the difficult weight distribution of samples in a first sample set before training a model to be trained, then adjust the first sample set according to a task target of the model to be trained and the difficult weight distribution to obtain a second sample set, and finally train the model to be trained by using the second sample set. In this way, the training device 200 can select a proper number of difficult samples to train in combination with the complexity of the task target of the model to be trained and the difficult weight of each sample in the process of training the model to be trained, thereby solving the problem that the difficult samples are difficult to label and cause the bottleneck in the training precision of the AI model, and improving the training precision of the AI model.

The training method provided by the present application is illustrated below with reference to fig. 8. As shown in fig. 8, assuming that a task target of a current model to be trained is face recognition in an outdoor video surveillance scene, the task scene is a more complex task scene, a first sample set used for training the model to be trained includes two types of samples, the first type of sample is a face image of ID1 (for example, a face image of a person Ann at each angle) including samples X11 to X14, and the second type of sample is a face image of ID2 (for example, a face image of a person Lisa at each angle), including samples X21 to X24, where a total of 8 samples are included. In this application scenario, as shown in fig. 8, the training method provided by the present application includes the following steps:

step 1, inputting each sample in each type of samples of the first sample set into a feature extraction model, and obtaining a feature vector of each sample. The feature extraction model is used for extracting human face features. As shown in fig. 8, feature vectors a11 to a14 can be obtained by inputting samples X11 to X14 to the feature extraction model, and feature vectors a21 to a24 can be obtained by inputting samples X21 to X24 to the feature extraction model. Specifically, reference may be made to step S221 in the foregoing description, which is not described herein again.

And 2, determining a reference feature vector of each type of sample in the first sample set. The reference feature vector of each type of sample may be an average value of the feature vectors of each type of sample, or may be one feature vector closest to the average value, or may be a feature vector that is mapped to a 2D space and then the feature vector corresponding to the point in the most densely distributed area is determined as the reference feature information of the type of sample. Fig. 8 illustrates an example of a feature vector closest to the average, such as the reference feature vector a14 and the reference feature vector a21 shown in fig. 8. Specifically, reference may be made to step S222 in the foregoing description, which is not repeated herein.

And 3, determining the distance between each feature vector and the reference feature vector of the corresponding category. As shown in fig. 8, a distance D11 between feature vectors a14 and a11, a distance D12 between feature vectors a14 and a12, a distance D13 between feature vectors a13 and a14, and a distance between feature vectors a14 and a14 may be calculated as 0, and similarly, a distance D21 between feature vectors a21 and a22, a distance D22 between feature vectors a21 and a23, a distance D23 between feature vectors a21 and a24, and a distance between feature vectors a21 and a21 in the second type of samples may be calculated as 0. The distance may be any one of the cosine distance, the euclidean distance, the manhattan distance, the chebyshev distance, and the manhattan distance in the foregoing description, and the present application is not particularly limited. This step can refer to step S223 and its optional steps in the foregoing, which are not described herein.

And 4, determining the difficult weight alpha of each sample of the first sample set, and obtaining the difficult weight distribution of the first sample set. The formula for the difficulty weight can be referred to as formula 5, i.e., α₁₁＝T×D11，α₁₂By analogy with T × D12, a difficulty weight for each of the 8 samples may be obtained as shown in fig. 8, where the difficulty weight is greater than the first threshold h₁The samples of (1) are indicated in dark color, i.e., the difficulty weights of sample X11 and sample X22 are above the threshold.This step can refer to step S224 and its optional steps in the foregoing, which are not described herein again.

And 5, determining the target difficulty weight of the model to be trained according to the task target of the model to be trained, and adjusting the difficulty weight distribution of the first sample set according to the target difficulty weight to obtain a second sample set. As shown in fig. 8, the difficult weight distribution of the first sample set is difficult samples: and (3) assuming that the non-difficult sample is 1:3, and the target difficulty weight corresponding to the task target is a difficult sample: the non-difficult samples are 3:1, but since the first sample set has only two difficult samples, namely X11 and X22, the difficult samples need to be extended by a data augmentation method, so that the number ratio of the extended difficult samples (6) to the non-difficult samples (2) reaches 3:1, thereby obtaining a second sample set for training the model to be trained. This step can refer to step S230 and its optional steps in the foregoing, which are not described herein again.

And 6, training the model to be trained by using the second sample set. The loss function of the model to be trained can be as shown in formula 3, and the loss function increases the influence of the difficult samples with large difficulty weights on the loss function in the training process of the model to be trained, so that the model to be trained can focus on the characteristics of the learning difficult samples, and the parameter updating by using the difficult samples is more prone to be performed, thereby achieving the purpose of reinforcement learning. And the constant T in the difficult weight can be set to be a higher value in the initial training stage, so that the influence of the difficult sample in the training of the model to be trained reaches the highest, and then the constant T in the difficult weight is set to be a lower value in the final training stage, so that the model to be trained tends to converge, and the difficult sample which consumes more time can be avoided, thereby improving the training speed. This step can refer to step S240 and its optional steps in the foregoing, which are not described herein again.

The training method comprises the steps of extracting the characteristics of each sample in a first sample set by using a characteristic extraction model, determining the reference characteristic vector of the same sample according to the characteristic vector extracted by each sample in the first sample set, determining the difficulty weight of each sample according to the distance between the characteristic vector of each sample in the same sample and the reference characteristic vector, adjusting the difficulty weight distribution of the first sample set according to the difficulty weight, and training the model to be trained by using a second sample set after the difficulty weight distribution is adjusted, so that the training device 200 can select a proper number of difficult samples for training by combining the complexity of a task target of the model to be trained and the difficulty weight of each sample in the process of training the model to be trained, and the problem of bottleneck of AI model training precision caused by difficulty and difficult security labeling is solved, the training precision of the AI model is improved. And the weight parameter of the loss function of the model to be trained is adjusted according to the difficult weight distribution of the second sample set, the value of the loss function is in a direct proportion relation with the difficult weight, the larger the difficult weight of the sample is, the larger the loss function value obtained by training the model to be trained by using the sample is, so that the model to be trained can be more concentrated on the characteristics of the learning difficult sample, thereby achieving the effect of strengthening the training of the difficult sample and further improving the prediction precision of the AI model.

The method of the embodiments of the present application is described above in detail, and in order to better implement the above-mentioned aspects of the embodiments of the present application, the following also provides related apparatuses for implementing the above-mentioned aspects.

Fig. 9 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 50. The chip may be provided in the training apparatus 200, the training apparatus 200 in the foregoing to complete the training work of the training unit 240 and the feature extraction work of the extraction module 211. The algorithm for each layer in the convolutional neural network shown in fig. 5 can be implemented in a chip as shown in fig. 9.

It should be noted that, the Neural-Network Processing Unit (NPU) 900 may be mounted on a main CPU (host CPU) as a coprocessor, and the main CPU800 allocates tasks, and the main CPU800 is responsible for determining which data needs to be executed by an NPU core, just like a manager, so as to issue an instruction to trigger the NPU900 to process the data. The NPU900 may also be integrated into a CPU, such as the kylin 970, or may be provided as a separate chip. The core part of the NPU900 is an arithmetic circuit 903, and the arithmetic circuit 903 is controlled by the controller 904 to extract matrix data in the memory and perform a multiplication operation, such as a convolution operation in the embodiment of fig. 5.

In some implementations, the arithmetic circuit 903 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 903 is a two-dimensional systolic array. The arithmetic circuit 903 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 903 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 902 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 901 and performs matrix arithmetic with the matrix B, and partial results or final results of the obtained matrix are stored in an Accumulator (Accumulator) 908.

The unified memory 906 is used to store input data as well as output data. The weight data is directly transferred to the weight Memory 902 through a Memory Access Controller (DMAC) 905. The input data is also carried into the unified memory 906 by the DMAC.

A Bus Interface Unit (BIU) 910 is configured to interact with the memory Unit access controller 905 and the Instruction Fetch memory (IFB) 909 through an Advanced eXtensible Interface (AXI) Bus protocol.

The bus interface unit 910 is used for the instruction fetch memory 909 to fetch instructions from the external memory, and is also used for the storage unit access controller 905 to fetch the original data of the input matrix a or the weight matrix B from the external memory.

The storage unit access controller 905 is mainly used to transfer input data in the external memory to the unified memory 906 or to transfer weight data to the weight memory 902 or to transfer input data to the input memory 901.

The vector calculation unit 907 includes a plurality of operation processing units, and further processes the output of the operation circuit such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/FC layer network calculation in the neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization) and the like.

In some implementations, the vector calculation unit 907 can store the processed output vectors to the unified buffer 906. For example, the vector calculation unit 907 may apply a non-linear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate the activation values. In some implementations, the vector calculation unit 907 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 903, e.g., for use in subsequent layers in a neural network.

An Instruction Fetch memory (Instruction Fetch Buffer)909 connected to the controller 904 and configured to store instructions used by the controller 904; the controller 904 is configured to call the instruction cached in the instruction fetch memory 909 to implement controlling the operation process of the operation accelerator.

Generally, the unified Memory 906, the input Memory 901, the weight Memory 902, and the instruction fetch Memory 909 are On-chip memories (On-chip memories). The external memory is private to the NPU hardware architecture. The external Memory may be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable Memory.

Fig. 10 is a hardware structure diagram of a computing device provided in the present application. Wherein the computing device 1000 may be the training device 200 in the embodiments of fig. 2-10. As shown in fig. 10, computing device 1000 includes: a processor 1010, a communication interface 1020, a memory 1030, and a neural network processor 1050. The processor 1010, the communication interface 1020, the memory 1030, and the neural network processor 1050 may be connected to each other via an internal bus 1040, or may communicate with each other via other means such as wireless transmission. In the embodiment of the present application, the bus 1040 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, for example, in which the bus 1040 is connected by the bus 1040. The bus 1040 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The processor 1010 may be formed of at least one general-purpose processor, such as a Central Processing Unit (CPU), or a combination of a CPU and a hardware chip. The hardware chip may be an Application-Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a Field-Programmable Gate Array (FPGA), General Array Logic (GAL), or any combination thereof. Processor 1010 executes various types of digitally stored instructions, such as software or firmware programs stored in memory 1030, which enable computing device 1000 to provide a wide variety of services.

The memory 1030 is configured to store program codes and is controlled by the processor 1010 to execute the processing steps of the training apparatus 200 in any of the embodiments of fig. 2-8. One or more software modules may be included in the program code. The one or more software modules may be software modules provided in the embodiment shown in fig. 3, such as an obtaining unit, a determining unit, an adjusting unit and a training unit, where the obtaining unit may be configured to obtain a first sample set, the determining unit may be configured to determine a difficulty weight distribution of the first sample set, the adjusting unit may be configured to adjust the difficulty weight distribution of the first sample set according to a difficulty weight corresponding to each sample of the first sample set and a task target of a model to be trained to obtain a second sample set, the training unit may be configured to perform training using the model to be trained of the second sample set, and may be specifically configured to perform steps S210-S230 and optional steps thereof, steps 1-6 and optional steps thereof of the foregoing method, and may also be configured to perform other steps performed by a training apparatus described in the embodiments of fig. 2-8, and will not be described in detail herein.

It should be noted that the present embodiment may be implemented by a general physical server, for example, an ARM server or an X86 server, or may be implemented by a virtual machine implemented based on the general physical server and combining with the NFV technology, where the virtual machine refers to a complete computer system that has complete hardware system functions and is run in a completely isolated environment through software simulation, and the present application is not limited in particular.

The neural network processor 1050 may be configured to derive an inference model from the training program and the sample data of the memory 1030 to perform at least a portion of the methods discussed herein, wherein the hardware structure of the neural network processor 1050 may be specifically referred to in fig. 9.

The Memory 1030 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory 1030 may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); memory 1030 may also include combinations of the above. The memory 1030 may store the first sample set and/or the second sample set, and the memory 1030 may store program codes, and may specifically include program codes for performing other steps described in the embodiments of fig. 2 to fig. 8, which are not described herein again.

The communication interface 1020 may be a wired interface (e.g., an ethernet interface), an internal interface (e.g., a Peripheral Component Interconnect express (PCIe) bus interface), a wired interface (e.g., an ethernet interface), or a wireless interface (e.g., a cellular network interface or a wireless lan interface), for communicating with other devices or modules.

It should be noted that fig. 10 is only one possible implementation manner of the embodiment of the present application, and in practical applications, the computing device may further include more or less components, which is not limited herein. For the content that is not shown or not described in the embodiment of the present application, reference may be made to the related explanation in the embodiment described in fig. 2 to fig. 8, and details are not described here.

It should be understood that the computing device shown in fig. 10 may also be a computer cluster of at least one server, and the application is not particularly limited.

Embodiments of the present application also provide a computer-readable storage medium, which stores instructions that, when executed on a processor, implement the method flows shown in fig. 2-8.

Embodiments of the present application also provide a computer program product, and when the computer program product runs on a processor, the method flows shown in fig. 2 to 8 are implemented.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes at least one computer instruction. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions can be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data can be transmitted to another website, computer, server, or data by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data storage device, or the like, that includes at least one collection of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD), or a semiconductor medium.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of training, the method comprising:

obtaining a first set of samples, the first set of samples comprising a plurality of samples;

determining a difficulty weight distribution for samples in the first set of samples;

adjusting the first sample set according to a task target of a model to be trained and the difficult weight distribution of the samples in the first sample set to obtain a second sample set;

and training the model to be trained by utilizing the second sample set.

2. The method according to claim 1, wherein the task goals of the model to be trained comprise one or more of an application scenario of the model to be trained after completion of training, an event type to be realized after completion of training of the model to be trained, and a training precision goal of the model to be trained.

3. The method according to claim 1 or 2, wherein the adjusting the first sample set according to the task goal of the model to be trained and the difficulty weight distribution of the samples in the first sample set to obtain a second sample set comprises:

determining target difficulty weight distribution which is required to be achieved by the sample set for training the model to be trained according to the task target of the model to be trained and the difficulty weight distribution of the samples in the first sample set;

and increasing or decreasing the number of samples in the first sample set, or changing the difficulty weight of a part of samples in the first sample set to obtain a second sample set, wherein the difficulty weight distribution of the samples in the second sample set is equal to or approximate to the target difficulty weight distribution.

4. The method of any of claims 1-3, wherein determining the difficulty weight distribution for the samples in the first set of samples comprises:

inputting each sample of the first sample set into a feature extraction model, and obtaining feature information of each sample, wherein each sample corresponds to a category;

determining reference characteristic information of multiple classes of samples in the first sample set according to the characteristic information of each sample, wherein each class of samples comprises at least one sample with the same class;

determining a difficulty weight corresponding to each sample based on the similarity between the characteristic information of each sample and the reference characteristic information of the corresponding category;

obtaining a difficulty weight distribution of the samples in the first sample set according to the difficulty weight of each sample in the first sample set.

5. The method according to any one of claims 1 to 4, wherein prior to training the model to be trained using the second set of samples, the method further comprises:

and adjusting the weight parameter of the loss function of the model to be trained according to the difficult weight distribution of the samples in the second sample set.

6. An exercise device, the device comprising:

an obtaining unit configured to obtain a first sample set, where the first sample set includes a plurality of samples;

a determining unit for determining a difficult weight distribution of samples in the first set of samples;

the adjusting unit is used for adjusting the first sample set according to a task target of a model to be trained and the difficult weight distribution of the samples in the first sample set to obtain a second sample set;

and the training unit is used for training the model to be trained by utilizing the second sample set.

7. The apparatus according to claim 6, wherein the task goal of the model to be trained comprises one or more of an application scenario of the model to be trained after completion of training, an event type to be realized after completion of training of the model to be trained, and a training precision goal of the model to be trained.

8. The apparatus according to claim 6 or 7,

the adjusting unit is specifically configured to:

9. The apparatus according to any one of claims 6 to 8,

the determining unit is specifically configured to:

10. The apparatus according to any of the claims 6 to 9, wherein, prior to training the model to be trained using the second set of samples, the training unit is further configured to: and adjusting the weight parameter of the loss function of the model to be trained according to the difficult weight distribution of the samples in the second sample set.

11. A computer-readable storage medium comprising instructions that, when executed on a computing device, cause the computing device to perform the method of any of claims 1 to 5.

12. A computing device comprising a processor and a memory, the processor executing code in the memory to perform the method of any of claims 1 to 5.

13. A computer program product comprising a computer program that, when read and executed by a computing device, causes the computing device to perform the method of any of claims 1 to 5.