WO2021238586A1 - 一种训练方法、装置、设备以及计算机可读存储介质 - Google Patents

一种训练方法、装置、设备以及计算机可读存储介质 Download PDF

Info

Publication number
WO2021238586A1
WO2021238586A1 PCT/CN2021/091597 CN2021091597W WO2021238586A1 WO 2021238586 A1 WO2021238586 A1 WO 2021238586A1 CN 2021091597 W CN2021091597 W CN 2021091597W WO 2021238586 A1 WO2021238586 A1 WO 2021238586A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
sample set
model
samples
trained
Prior art date
Application number
PCT/CN2021/091597
Other languages
English (en)
French (fr)
Inventor
张梦阳
王兵
周宇飞
郑宜海
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021238586A1 publication Critical patent/WO2021238586A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of Artificial Intelligence (AI), and in particular to a training method, device, device, and computer-readable storage medium.
  • AI Artificial Intelligence
  • AI models have been widely used in video images, speech recognition, natural language processing and other related fields.
  • AI models usually need to be trained with a large number of samples.
  • hard samples images are often more useful than simple sample images.
  • difficult samples refer to samples that are difficult for the model to distinguish. Specifically, they can be fuzzy, overexposed, and unclear samples, or samples that are very similar to other samples.
  • Even a large number of simple samples can hardly greatly improve the prediction accuracy of the AI model, and difficult sample images often greatly improve the prediction accuracy of the AI model.
  • This application provides a training method, device and computer readable storage medium, which are used to solve the problem that the current difficult samples are difficult to label and the training accuracy of the AI model becomes a bottleneck.
  • a training method which includes the following steps:
  • first sample set After determining the difficulty weight distribution of the samples in the first sample set, first adjust the first sample set to obtain the second sample set according to the task objective of the model to be trained and the difficulty weight distribution of the samples in the first sample set Sample set, and finally use the second sample set to train the model to be trained.
  • the difficulty weight distribution of the samples in the first sample set can be determined, and then according to the task objective of the model to be trained and the above difficulty weight distribution, the first sample set Make adjustments to obtain a second sample set, and finally use the second sample set to train the model to be trained.
  • the training device 200 can combine the complexity of the task target of the model to be trained and the difficulty weight of each sample, and select an appropriate number of difficult samples for training. This solves the problem that difficult samples are difficult to label.
  • the training accuracy of the training model has a bottleneck problem, which improves the training accuracy of the model to be trained.
  • the task objectives of the model to be trained include one of the application scenarios of the model to be trained after the training is completed, the type of event that the model to be trained needs to achieve after the training is completed, and the training accuracy target of the model to be trained.
  • the model to be trained is an AI model, such as a neural network model.
  • the difficulty of the task objectives of different models to be trained is different.
  • the second sample set used during training It can contain more difficult samples with small weights, use a large number of simple samples for training, and a smaller number of difficult samples for auxiliary training, which can increase the training speed while achieving the task goal; conversely, if you train one to achieve the complex task goal
  • the model to be trained such as face recognition in outdoor video surveillance scenes
  • the second sample set used in training can contain more difficult samples with heavy weights.
  • the training device used to train the model to be trained can maintain a corresponding relationship database.
  • the corresponding relationship database stores the corresponding relationships between multiple task targets and multiple target difficulty weight distributions.
  • the training device determines the first After the difficulty weight distribution of the same set, the target difficulty weight distribution corresponding to the task target can be determined according to the task target of the model to be trained and the above-mentioned corresponding relationship library, so as to according to the difficulty weight distribution of the first sample set and the target difficulty weight distribution Adjust the difficulty weight distribution of the first sample set to obtain the second sample set for training the model to be trained.
  • the difficulty weight distribution of the samples in the second sample set obtained after adjustment may be equal to the target difficulty weight distribution, or it may be similar to the target difficulty weight distribution .
  • the implementation of the above implementation method when adjusting the difficulty weight distribution of the first sample set, determine the target difficulty weight distribution according to the task target of the model to be trained, and then determine the difficulty weight distribution of the first sample set according to the target difficulty weight distribution The adjustment is made so that the second sample set obtained in this way is more suitable for training the model to be trained, and the training accuracy of the model to be trained can be improved in a targeted manner to achieve the purpose of reinforcement learning.
  • each sample in the first sample set can be input to the feature extraction model to obtain the feature information of each sample, and then according to For the feature information of each sample, determine the reference feature information of the multiple types of samples in the first sample set, and then determine the difficulty of each sample based on the similarity between the feature information of each sample and the reference feature information of the corresponding category Weight, so as to obtain the difficulty weight distribution of the samples in the first sample set.
  • the feature extraction model is used to extract the feature information of the sample. It can be an AI model that has been trained before the first sample set is obtained.
  • the feature extraction model can be an AI model that is available in the industry for extracting sample features. Any of them, for example, the feature descriptor (Histogram of Oriented Gradient, HOG) used for target detection, the local binary pattern (LBP), the convolutional layer of the convolutional neural network, etc., this application will not make specific details limited.
  • the source of the aforementioned sample set may include mobile phones or surveillance cameras, local offline data, Internet public data, etc., which are not specifically limited in this application.
  • the feature information of each sample extracted by the feature extraction model may specifically be a feature vector or a feature matrix. Assuming that the number of samples in the first type of sample is n, the feature information obtained after each sample of this type of sample is input into the feature extraction model is the feature vector B 1 , B 2 ,..., B n , then the reference of this type of sample The feature information can be the average vector A of these n feature vectors, or the feature vector B j that is closest to the above average vector A among the n feature vectors, where j ⁇ n can also be the feature of each type of sample After the vector is mapped to the 2D space, the feature vector corresponding to the point in the most densely distributed area is determined as the reference feature information of this type of sample. This application does not limit the method for determining the reference feature information.
  • the difficulty weight of each sample can be determined according to the distance between the feature vector of each sample and the reference feature vector of the corresponding category.
  • the feature vector of each sample and the corresponding category The greater the distance between the reference feature vectors, the smaller the similarity between the feature vector of the sample and the reference feature vector of the corresponding category, the greater the difficulty weight of the sample, that is to say, the distance and the difficulty weight are positive Proportional relationship, the relationship between similarity and difficulty weight is inversely proportional.
  • Implement the above-mentioned implementation method use the feature extraction model to extract the feature information of each sample and the information of each type of sample in the sample set, and then determine the similarity or distance between the feature information of each sample and the reference feature information of the corresponding category
  • the difficulty weight of each sample is obtained based on the characteristics of the sample itself. It has nothing to do with the structure of the training model and the training method used. It can well reflect the difficulty of the sample. , The accuracy of the labeling of difficult samples is very high, which solves the problem of bottlenecks in the training accuracy of the AI model due to the difficulty of labeling difficult samples.
  • the above method may further include the following steps: adjust the weight of the loss function of the model to be trained according to the difficulty weight distribution of the samples in the second sample set parameter.
  • the formula of the loss function Loss1 of the model to be trained can be as follows:
  • the greater the value of the loss function obtained will be.
  • Using the loss function to perform backpropagation supervised training of the model to be trained can make the model to be trained more inclined to use difficult samples
  • To update the parameters of the model to be trained can focus more on learning the characteristics of difficult samples, and is more inclined to use difficult samples for parameter update, so as to achieve the purpose of intensive training of the model to be trained for difficult samples, thereby improving the model’s feature expression for difficult samples ability.
  • a training device in a second aspect, includes: an acquisition unit for acquiring a first sample set, the first sample set includes a plurality of samples; a determining unit, for determining the value of a sample in the first sample set Difficulty weight distribution; adjustment unit, used to adjust the first sample set to obtain the second sample set according to the task objective of the model to be trained and the difficulty weight distribution of the samples in the first sample set; training unit, used to use the second sample Set, train the model to be trained.
  • the task objectives of the model to be trained include one of the application scenarios of the model to be trained after the training is completed, the type of event that the model to be trained needs to achieve after the training is completed, and the training accuracy target of the model to be trained.
  • the type of event that the model to be trained needs to achieve after the training is completed includes one of the training accuracy target of the model to be trained.
  • the adjustment unit is specifically used to determine the target difficulty weight distribution that the sample set used to train the model to be trained should reach according to the task goal of the model to be trained and the difficulty weight distribution of the samples in the first sample set
  • the adjustment unit is used to increase or decrease the number of samples in the first sample set, or to change the difficulty weights of some samples in the first sample set to obtain a second sample set, where the difficulty weight distribution of the samples in the second sample set is equal to or Approximate to the target difficulty weight distribution.
  • the determining unit is specifically configured to input each sample of the first sample set into the feature extraction model to obtain feature information of each sample, where each sample corresponds to a category; the determining unit uses To determine the reference feature information of the multiple types of samples in the first sample set according to the feature information of each sample, where each type of sample includes at least one sample of the same category; the determining unit is used to determine the feature information and corresponding The similarity between the reference feature information of the categories determines the difficulty weight corresponding to each sample; the determining unit is used to obtain the difficulty weight distribution of the samples in the first sample set according to the difficulty weight of each sample in the first sample set.
  • the adjustment unit before using the second sample set to train the model to be trained, is further configured to adjust the weight parameter of the loss function of the model to be trained according to the difficulty weight distribution of the samples in the second sample set.
  • a computer program product including a computer program.
  • the computer program When the computer program is read and executed by a computing device, the method described in the first aspect is implemented.
  • a computer-readable storage medium including instructions, which, when the instructions run on a computing device, enable the computing device to implement the method described in the first aspect.
  • a computing device including a processor and a memory, and the processor executes code in the memory to implement the method described in the first aspect.
  • a chip including a memory and a processor; the memory is coupled to the processor, the processor includes a modem processor, and the memory is used to store computer program codes.
  • the computer program codes include computer instructions. The computer instructions are read from the memory, so that the chip executes the method described in the first aspect.
  • Figure 1 is a schematic diagram of the structure of a training and prediction system
  • Figure 2 is an example diagram of difficult samples in an application scenario
  • Figure 3 is a schematic structural diagram of a training device provided by the present application.
  • Fig. 4 is a schematic flowchart of a training method provided by the present application.
  • Figure 5 is a schematic diagram of the structure of a convolutional neural network
  • Figure 6 is an example diagram of reference feature information of each type of sample in an application scenario
  • FIG. 7 is an example diagram of data distribution of the data of the first sample set in the second sample set in an application scenario
  • FIG. 8 is a schematic flowchart of the training method provided by this application in an application scenario
  • FIG. 9 is a schematic diagram of the structure of a chip provided by the present application.
  • Fig. 10 is a schematic structural diagram of a computing device provided by the present application.
  • Loss function is used to estimate the degree of inconsistency between the predicted value f(x) of the model and the true value y, usually a non-negative real valued function. The smaller the value of the loss function, the better the robustness of the model.
  • the model still needs further training, and the network learning direction can be adjusted through the loss function to obtain a model with good final performance.
  • the above-mentioned loss function formula is only used for illustration, and this application does not limit the specific formula of the loss function.
  • Feature Extraction A method of transforming a certain measurement value to highlight the representative characteristics of the measurement value.
  • the neural network can use the Back Propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forwarding the input signal to the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating error loss information (such as the value of the loss function), so that the error loss is converged.
  • the backpropagation algorithm is a backpropagation motion dominated by error loss, and aims to obtain the optimal neural network model parameters, such as the weight matrix.
  • AI Artificial Intelligence
  • the AI model is a collection of mathematical methods to achieve AI.
  • a large number of samples can be used to train the AI model to enable the trained AI model to obtain predictive capabilities.
  • the training phase can first be marked with multiple spam labels and multiple non-spam labels.
  • the sample set of spam labels trains the neural network.
  • the neural network continuously captures the connections between these emails and labels to adjust and improve the network structure parameters.
  • the neural network can determine whether new mails without labels are spam.
  • the classification of the mail It should be understood that the above examples are used for illustration and cannot constitute a specific limitation.
  • Figure 1 is an architecture diagram of an AI model training and prediction system.
  • the system 100 is a commonly used system architecture in the AI field.
  • the system 1000 includes: training equipment 100, execution equipment 200, database 130, and client equipment 140 and data collection equipment 150.
  • the various components in the system 100 may be connected to each other through a network, where the network may be a wired network or a wireless network, or a mixture of the two. in,
  • the training device 200 can be a physical server, such as an X86 server, an ARM server, etc., or a virtual machine (VM) based on a general physical server combined with Network Functions Virtualization (NFV) technology.
  • VM virtual machine
  • NFV Network Functions Virtualization
  • a machine refers to a complete computer system with complete hardware system functions simulated by software and running in a completely isolated environment, such as a virtual machine in cloud data, which is not specifically limited in this application.
  • the training device 200 is configured to use the sample set in the database 130 to train the model to be trained, obtain the target model, and send it to the execution device 100. Specifically, the training device 200 may compare the output data of the model to be trained with the label of the sample data when training the model to be trained, and continuously adjust the structural parameters of the model to be trained according to the comparison result, until the training device 200 outputs the data and The label of the sample data is less than a certain threshold, so as to complete the training of the model to be trained and obtain the target model.
  • the model to be trained and the target model here can be any AI model, such as the neural network model used to classify spam in the above example, it can also be an image classification model, it can also be a semantic recognition model, etc.
  • the sample sets maintained in the database 130 may not all come from the data collection device 150, but may also be received from other devices.
  • the database 130 may be a local database, or a database on the cloud or other third-party platforms, and this application does not specifically limit it.
  • the execution device 100 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality/virtual reality, a vehicle-mounted terminal, etc., or a server or cloud device, which is not specifically limited in this application.
  • a terminal such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality/virtual reality, a vehicle-mounted terminal, etc., or a server or cloud device, which is not specifically limited in this application.
  • the execution device 100 is used to implement various functions according to the target model trained by the training device 200. Specifically, in FIG. 1, the user can input data to the execution device 100 through the client device 140, use the target model to predict the input data, and obtain the output result. The execution device 100 can return the output result to the client device 140 for the user to view the result output by the execution device 100.
  • the specific presentation form can be display, sound, action and other specific methods; the execution device 100 can also take the output result as a new
  • the samples are stored in the database 130 for the training device 200 to use the new samples to re-adjust the structural parameters of the target model to improve the performance of the target model.
  • the client device 140 is a mobile phone terminal
  • the execution device 100 is a cloud device
  • the trained target model is a semantic recognition model.
  • the user can input text data to be recognized into the execution device through the client device 140, and the execution device 100 uses the target model Perform semantic recognition on the above-mentioned text data to be recognized, and return the result of the semantic recognition to the client device 140, so that the user can view the result of the semantic recognition through the user device (mobile terminal).
  • Fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the database 130 is an external memory relative to the training device 200. In other cases, the database 130 can also be placed in the training device 200, which is not specifically limited in this application.
  • the realization of various applications in the AI field depends on the AI model.
  • Different functions, such as classification, recognition, detection, etc. are implemented through the AI model.
  • the AI model needs to be pre-trained with a sample set before it can be used in the execution device.
  • the hard samples (HardSamples) are often more effective than the simple samples.
  • difficult samples refer to samples that are difficult to distinguish by the AI model. They can be divided into two categories. One is blurry, overexposed, and unclear outline samples, regardless of the AI model of the algorithm and the initialization of this type of sample. Parameters will make AI model predictions wrong; the other is the samples that are very similar to other samples and make the AI model difficult to distinguish.
  • difficult samples can be obtained by manual labeling or machine labeling.
  • Manual labeling of difficult samples is not only a waste of manpower and time. The labeling accuracy cannot be guaranteed due to personal cognitive bias, work fatigue and other reasons.
  • computing equipment obtains sample features by checking each pixel. Some human eyes seem Samples that are not similar may also be difficult samples for AI models, resulting in poor accuracy of manual labeling of difficult samples.
  • sample A is (0,1)
  • the label of sample A is (0,1)
  • the prediction vector is (0.4, 0.6)
  • the classification result of the classification model M 1 is displayed
  • Sample A belongs to the second category, the classification result is correct, but the difference between the prediction vector (0.4, 0.6) and the sample label (0, 1) is very large, and the value of the loss function is also very large.
  • sample A is a correctly classified sample
  • M 1 which is a difficult sample. Therefore, the incorrectly classified samples are regarded as difficult samples, and the labeling accuracy is very poor.
  • a sample with a larger loss function value is used as a difficult sample, some simple samples may also be incorrectly labeled as difficult samples.
  • the loss function is used to estimate the degree of inconsistency between the predicted value of the model and the true value. There are many reasons for the inconsistency between the predicted value of the model and the true value.
  • the sample is indeed a difficult sample, or the selected model structure or training method is defective, and the sample itself is not a difficult sample that is difficult to distinguish. Therefore, a sample with a larger loss function value is regarded as a difficult sample, and the labeling accuracy is also poor.
  • the present application provides a training device 200, which can be applied to the AI model training and prediction system shown in FIG. 1, as shown in FIG.
  • the training device 200 may include an acquisition unit 210, a determination unit 220, an adjustment unit 230, a database 140, a database 150, and a training unit 240.
  • the obtaining unit 210 is configured to obtain a first sample set, where the first sample set includes a plurality of samples.
  • the determining unit 220 is used to determine the difficulty weight distribution of the samples in the first sample set.
  • the difficulty weight distribution of a sample refers to the ratio of the number of samples corresponding to each difficulty weight. For example, in sample set A, the number of samples with a difficult sample weight of 1 is 1000, and the number of samples with a difficult sample weight of 2 is 2000. The number of samples with a weight of 3 is 3000, so the difficulty weight distribution of the samples in the sample set A is 1:2:3. It should be understood that the above examples are only for illustration and cannot be used as specific limitations.
  • the determining unit 220 may use the feature extraction model in the database 150 to determine the difficulty weight distribution of the samples in the first sample set. Specifically, the determining unit 220 may use the feature extraction model in the database 150 to perform feature extraction on each sample in the first sample set to obtain feature information of each sample, and then based on the feature of each sample in each type of sample. Information, determine the reference feature information of each type of sample, and finally determine the difficulty weight corresponding to each sample based on the similarity between the feature information of each sample and the reference feature information of the corresponding category.
  • the determining unit 220 may input the first sample set into the feature extraction model in the database 150 to obtain the feature vector of each sample in the first sample set, and then use the average vector of the feature vector of each class of samples as the class.
  • the reference feature information of the sample, and finally the difficulty weight corresponding to each sample is determined according to the similarity or distance between the feature vector of each sample and the average vector of the corresponding category.
  • the adjustment unit 230 is configured to adjust the difficulty weight distribution of the first sample set according to the difficulty weight of each sample and the task target of the model to be trained to obtain the second sample set.
  • the adjustment unit 230 may first determine the target difficulty weight distribution that should be achieved by the sample set used to train the model to be trained according to the task objective of the model to be trained, and then determine the difficulty weight distribution of the samples in the first sample set. Increase or decrease the number of samples in the first sample set, or change part of the samples in the first sample set to obtain the second sample set, so that the difficulty weight distribution of the samples in the second sample set is equal to or approximate to the target difficulty weight distribution .
  • the second sample set used in training can contain more difficult samples with small weights. In this way, it is easy to use a large number of samples. Sample training, and a smaller number of difficult samples for auxiliary training can improve the training speed while achieving task goals; conversely, if you train a model to be trained to achieve complex task goals, such as people in outdoor video surveillance scenes For face recognition, the second sample set used in training can contain more difficult samples with heavy weights.
  • the training unit 240 is configured to use the second sample set to train the model to be trained to obtain a trained target model.
  • the training unit 240 before the training unit 240 uses the second sample set to train the training model, it can first adjust the weight parameters of the loss function of the model to be trained according to the difficulty weight of each sample in the second sample set, and then use the second sample set When training the model to be trained, back-propagation supervised training is performed on the model to be trained according to the loss function to obtain the target model.
  • the difficulty weight of each sample in the second sample set is in direct proportion to the value of the loss function, so that difficult samples with a large difficulty weight have a greater impact on the loss function, and the AI model can be more focused on Learning the characteristics of difficult samples is more inclined to use difficult samples for parameter update, so as to achieve the purpose of intensive training of the model to be trained against difficult samples and improve the performance of the model to be trained.
  • the positional relationship between the device and the unit shown in FIG. 3 does not constitute any limitation.
  • the database 130 is an external memory relative to the training device 200.
  • the database 130 may also be It is placed in the training device 200;
  • the database 140 and the database 150 are internal memory relative to the training device 200.
  • the database 140 and/or the database 150 can also be placed in an external memory.
  • the training device 200 may first determine the difficulty weight distribution of the samples in the first sample set before training the model to be trained, and then determine the difficulty weight distribution of the samples in the first sample set according to the task target of the model to be trained and the above difficulty weight distribution, Adjust the first sample set to obtain the second sample set, and finally use the second sample set to train the model to be trained.
  • the training device 200 can combine the complexity of the task target of the model to be trained and the difficulty weight of each sample, and select an appropriate number of difficult samples for training, which solves the difficulties and security problems caused by the difficulty of labeling.
  • the training accuracy of the AI model has a bottleneck problem, which improves the training accuracy of the AI model.
  • the method may include the following steps:
  • the training device 200 obtains a first sample set, where the first sample set includes multiple samples.
  • the sample can be any form of sample, such as an image sample, a text sample, a voice sample, a biological data (for example, fingerprint, iris) sample, and so on.
  • the first sample set can include samples of multiple categories. For example, samples of one type are all "cookie" images, samples of one type are all images of the same face from various angles, and samples of one type are all of the same model. The images of the vehicle in different angles and different scenes can be specifically classified according to the task target of the model to be trained.
  • the face images of the same person can be divided into one category, for example, category 1 is Xiao Ming's face photos, and category 2 is Xiao Gang's face photos . It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • the training device 200 determines the difficulty weight distribution of the samples in the first sample set.
  • each sample of the first sample set can be input to the feature extraction model to obtain feature information of each sample, where each sample corresponds to a category, and then the first sample is determined according to the feature information of each sample
  • Step S220 Based on the similarity between the feature information of each sample and the reference feature information of the corresponding category, the difficulty corresponding to each sample is determined.
  • the weight according to the difficulty weight of each sample in the first sample set, obtains the difficulty weight distribution of the samples in the first sample set. Step S220 will be described in step S221-step S224 below.
  • the training device 200 adjusts the first sample set to obtain the second sample set according to the task target of the model to be trained and the difficulty weight distribution of the samples in the first sample set.
  • the task objectives of the model to be trained include one or more of the application scenarios of the model to be trained after the training is completed, the type of event that the model to be trained needs to achieve after the training is completed, and the training accuracy objectives of the model to be trained, for example Said, the face recognition model in the application scenario of video surveillance and the face recognition model in the application scenario of mobile phone unlocking, the target difficulty weight distribution of the samples required during training is different; the event type of identity recognition and clothing recognition For this type of event, the target difficulty weight distribution of the samples required when training the model is different; the target difficulty weight distribution of the samples required for training is also different for the models to be trained for low training accuracy targets and high training accuracy targets.
  • the above examples are only for illustration and cannot constitute a specific limitation.
  • the second sample set used in training can contain more difficult samples with small weights. In this way, a large number of simple samples are used. Training, with a smaller number of difficult samples for auxiliary training, can improve the training speed while achieving task goals; conversely, if you train a model for achieving complex task goals, such as face recognition in outdoor video surveillance scenarios, training The second sample set used at the time can contain more difficult samples with heavy weights. In this way, using a large number of difficult samples for training and a smaller number of simple samples for auxiliary training can enable the model to be trained to focus more on the learning of difficult samples. Improve the training accuracy of the model to be trained, so as to achieve the purpose of reinforcement learning.
  • the training device 200 may maintain a corresponding relationship database, which stores the corresponding relationships between multiple task targets and multiple target difficulty weight distributions. In this way, the training device 200 determines the first one in the database 130. After the difficulty weight distribution in this set, the target difficulty weight distribution corresponding to the task target can be determined according to the task target of the model to be trained and the above-mentioned corresponding relationship library, so as to determine the difference between the difficulty weight distribution of the first sample set and the target difficulty weight distribution , Adjust the difficulty weight distribution of the first sample set to obtain a second sample set for training the model to be trained.
  • the foregoing correspondence library may be stored in the internal memory of the training device 200 or in the external memory of the training device 200, which may be specifically determined by the processing and storage capabilities of the training device, and is not specifically limited in this application.
  • the number of samples in the first sample set may be increased or decreased, or the difficulty weights of some samples in the first sample set may be changed to obtain The second sample set.
  • the first sample set P1 contains 10,000 samples, and the number of difficult samples is 3000.
  • the difficulty weight distribution of the samples in the second sample set obtained after adjustment may be equal to the target difficulty weight distribution, or it may be similar to the target difficulty weight distribution .
  • the sample size of 5 is 1600. At this time, the number of samples with difficulty weight ⁇ of 4 and difficulty weight ⁇ of 5 is insufficient.
  • Data enhancement can refer to some difficult samples or simple The samples are randomly disturbed to obtain more difficult samples or simple samples, where the random disturbance includes adding noise points, changing lighting information, changing environmental information (such as weather, background, time), and so on.
  • Data enhancement can also mean that after some difficult samples or simple samples are input into Generative Adversarial Networks (GAN), more difficult samples or simple samples are obtained.
  • GAN can include discriminant networks and generative networks.
  • the generation network is used to generate pictures based on input data, and the judgment network is used to judge whether the input pictures are real pictures.
  • the goal of generating the network is to generate real pictures as much as possible so that the output result of the discriminating network is true.
  • the goal of discriminating the network is to discern accurate results as much as possible, that is, the data result of discriminating the pictures generated by the generating network is Falsity, the two networks form a dynamic "game” process, and finally the trained GAN can generate "fake and real" pictures, so as to obtain more difficult samples or simple samples.
  • the training device 200 uses the second sample set to train the model to be trained.
  • the training model before using the second sample set to train the model to be trained, adjust the weight parameters of the loss function of the model to be trained according to the difficulty weight of the samples in the second sample set, and then use the second sample set to train the model to be trained
  • the training model can be back-propagated supervised training according to the above loss function to obtain the training model after training.
  • the difficulty weight of each sample in the loss function is proportional to the value of the loss function. Therefore, the difficulty weight After a larger sample is input to the model to be trained, the value of the loss function obtained is larger.
  • Using the loss function to perform back-propagation supervised training of the model to be trained can make the model to be trained more inclined to use difficult samples for parameter update.
  • the formula of the loss function Loss1 of the model to be trained can be as follows:
  • w and b are the parameters of the model to be trained
  • x is the input data
  • y is the output data
  • m is the number of input data
  • n is the number of categories of the model to be trained.
  • the formula of Loss1 can be:
  • Loss0 can use any of various Loss formulas existing in the industry, such as a mean square error loss function, a cross entropy loss function, etc., which are not specifically limited in this application.
  • step S220 The specific process of the training device 200 determining the difficulty weight distribution of the samples in the first sample set at the foregoing step S220 will be described in detail below. This step can be divided into the following steps in detail:
  • Step S221 Input each sample of the first sample set to the feature extraction model to obtain feature information of each sample, where each sample corresponds to a category.
  • the feature information of each sample extracted by the feature extraction model may specifically be a feature vector or a feature matrix, which is taken as an example to facilitate a better understanding of the present application.
  • the following will take the feature information as the feature vector as an example for illustration.
  • the feature vector is the numerical feature of the sample expressed in the form of a vector, which can effectively characterize the feature of the sample.
  • the feature vector is a multi-dimensional vector, such as a 512-dimensional vector or a 1024-dimensional vector.
  • the specific dimension of the vector is not limited.
  • the feature extraction model is used to extract a certain type of feature of the sample. Different feature extraction models extract different feature vectors for the same sample.
  • the feature extraction model used to extract face attributes can extract sample A.
  • the feature extraction model used to extract vehicle attributes can extract features such as the wheel and steel material of sample A. Therefore, the feature extraction model can be determined according to the task objective of the model to be trained. If the model to be trained is a face recognition network, then the feature extraction model used in step S221 is a feature extraction model used to extract facial attributes. The model is a vehicle recognition network, and the feature extraction model used in step S220 is a feature extraction model used to extract vehicle attribute features. It should be understood that the above example is only for illustration and does not constitute a specific limitation.
  • the feature vectors obtained after the simple sample and the difficult sample are input into the feature extraction model are different.
  • the quality of the feature vector extracted from the simple sample is very good, and the quality of the feature vector extracted from the difficult sample is very poor.
  • the quality of the feature vector Depending on its ability to distinguish different types of image samples, good features should be informative and not affected by noise and a series of transformations.
  • the category to which the sample belongs can be quickly obtained.
  • poor quality feature information Lack it is difficult to determine the category the sample belongs to after entering the classifier.
  • a feature extraction model used to extract face attributes can easily extract features of eyes, nose, and mouth when performing feature extraction on a simple sample, while it is difficult to extract whether the sample contains features for difficult samples.
  • the features of eyes, nose and mouth therefore, the feature vectors between simple samples should be similar, and the feature vectors of difficult samples are different from the feature vectors of simple samples.
  • the feature extraction model in the database 150 is used to extract the feature information of the sample. It can be an AI model trained before step S210.
  • the feature extraction model can be an AI model that is already available in the industry for extracting sample features. Any of them, for example, the feature descriptor (Histogram of Oriented Gradient, HOG) used for target detection, the local binary pattern (LBP), the convolutional layer of the convolutional neural network, etc., this application will not make specific details limited.
  • the source of the aforementioned sample set may include mobile phones or surveillance cameras, local offline data, Internet public data, etc., which are not specifically limited in this application.
  • the following takes a convolutional neural network as an example to illustrate the feature extraction model.
  • Convolutional Neural Network (Convolutional Neuron Network, CNN) is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to algorithms that are learned by computing devices. There are multiple levels of learning at the abstract level. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network responds to overlapping regions in the input image. As shown in FIG. 5, a convolutional neural network (CNN) 300 may include an input layer 310, a convolutional layer/pooling layer 320, and a neural network layer 330, where the pooling layer is an optional network layer.
  • CNN convolutional neural network
  • Convolutional layer/pooling layer 320 may include layers 321-326 as examples.
  • layer 321 is a convolutional layer
  • layer 322 is Pooling layer
  • layer 323 is a convolutional layer
  • layer 324 is a pooling layer
  • 325 is a convolutional layer
  • 326 is a pooling layer
  • 321 and 322 are convolutional layers
  • 323 is a pooling layer
  • Layers, 324 and 325 are convolutional layers
  • 326 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 321 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels, depending on the value of stride), to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolution output with a single depth dimension, but in most cases, a single weight matrix is not used, but multiple weight matrices with the same dimensions are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image.
  • the multiple weight matrices have the same dimensions, and the feature maps extracted from the multiple weight matrices with the same dimensions are also the same, and then the extracted feature maps with the same dimensions are combined to form a convolution operation. Output to obtain the final feature vector.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can extract specific information from the input image, generate a feature vector, and then input the feature vector into the neural network layer Perform classification processing to help the convolutional neural network 300 make correct predictions.
  • the initial convolutional layer (such as 321) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example, 326) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • pooling layer Because it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, that is, each layer of 321-326 as illustrated by 320 in Figure 5, which can be one layer
  • the convolutional layer is followed by a pooling layer, or it can be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 300 After processing by the convolutional layer/pooling layer 320, the convolutional neural network 300 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 320 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 300 needs to use the neural network layer 330 to generate one or a group of required classes of output. Therefore, the neural network layer 330 may include multiple hidden layers (331, 332 to 33n as shown in FIG. 5) and an output layer 340. The parameters contained in the multiple hidden layers can be based on specific task types. The relevant training data is obtained by pre-training.
  • the output layer 340 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the input layer 310 and the convolutional layer/pooling layer 320 are used to extract sample features to obtain the feature vector of the sample, and the neural network layer 330 is used to compare the input image according to the feature vector extracted by the convolutional layer/pooling layer 320.
  • the feature extraction model required by this application can be simply understood as a convolutional neural network that only includes the convolutional layer/pooling layer 320 and does not include the neural network layer 330. It should be understood that the above examples are only for illustration and cannot constitute a specific limitation.
  • Step S222 According to the feature information of each sample, determine the reference feature information of the multiple types of samples in the first sample set, where each type of sample includes at least one sample of the same category.
  • the reference feature information of this type of sample It can be the average vector A of these n vectors, or the vector B j that is closest to the above average vector A among the n vectors, where j ⁇ n, and similarly, the vectors of other types of samples can be obtained.
  • the reference feature When the information is expressed in the form of a vector, the reference feature information is also called a reference feature vector.
  • the feature information of each sample is a 512-dimensional feature vector
  • the multi-dimensional feature vector obtained in step S221 is mapped to a 2D space, and drawn in a rectangular coordinate system in the form of coordinate points.
  • the reference for each type of sample The feature information can be as shown in Figure 6. It should be understood that FIG. 6 is only used as an example.
  • the reference feature information of each type of sample can also be that after the feature vector of each type of sample is mapped to the 2D space, the feature vector corresponding to the point in the most densely distributed area is determined as that type.
  • this application does not limit the method for determining the reference feature information.
  • Step S223 Determine the difficulty weight corresponding to each sample based on the similarity between the feature information of each sample and the reference feature information of the corresponding category.
  • the difficulty weight of each sample can be determined according to the distance between the feature vector of each sample and the reference feature vector of the corresponding category, and the feature vector of each sample and the reference feature vector of the corresponding category The greater the distance between, the smaller the similarity between the feature vector of the sample and the reference feature vector of the corresponding category, the greater the difficulty weight of the sample, that is to say, the distance and the difficulty weight are proportional to each other.
  • the feature vector B 1 can be compared with the reference the distance between the feature vector a determines that the feature vector B difficulty weight of 1, determining a feature vector B difficulty weight of 2 according to a distance between the feature vector B 2 and the reference feature vectors a, ..., the feature vector B n and the reference feature vector
  • the distance between A determines the difficulty weight of the feature vector B n.
  • the difficulty weight of each sample can be determined according to the distance between each sample and the reference feature vector of the corresponding category.
  • the distance between the feature vector of a sample and the reference feature vector can be Cosine Distance (Cosine Distance), Euclidean Distance (Euclidean Distance), Manhattan Distance (Manhattan Distance), Chebyshev Distance (Chebyshev Distance), Minkov
  • the similarity between the feature information of a sample and the reference feature information can be Cosine Similarity, Adjusted Cosine Similarity, and Pearson Correlation Coefficient (Minkowski Distance), etc. Correlation Coefficient, Jaccard Coefficient, etc., this application does not make specific restrictions.
  • equation (4) can determine the distance D i between each sample feature vector and the reference feature vector B i A. It should be understood that the above formula 4 is only used for illustration and does not constitute a specific limitation.
  • the feature extraction model used to extract sample features includes multiple weight matrices for extracting specific features.
  • Each weight matrix can extract specific colors, specific edge information, etc., so for simple For samples, the weight matrix can extract specific colors, specific edge information, etc., and the feature vectors obtained by different simple samples are very similar; for difficult samples, the weight matrix may not be able to extract specific colors. , Specific edge information, etc., so the feature vector extracted from the difficult sample is very different from the feature vector extracted from the simple sample. In this way, by determining the distance between the feature vector extracted by each sample and the reference feature vector, the degree of difficulty of the sample can be well determined.
  • the formula for the difficulty weight ⁇ i can be:
  • T is a constant greater than 1, it should be understood that the above formula 5 is only used for illustration, and the formula of the difficulty weight ⁇ may be other formulas in which the difficulty weight ⁇ is in a proportional relationship with the distance D, which is not specifically limited in this application.
  • the formula of the difficulty weight can be:
  • the constant T in the difficulty weight ⁇ i can be an adjustable constant.
  • T in the initial stage of training the model to be trained, T can be a relatively large constant such that The difficulty weight of the difficult sample is higher, the value of the loss function is larger, and the learning center of the model to be trained is more biased towards the difficult sample.
  • T can be appropriately reduced, because the AI model has tended to converge at this time, and it is no longer necessary to favor difficult samples that consume more time, thereby increasing the training speed.
  • Step S224 According to the difficulty weight of each sample in the first sample set, obtain the difficulty weight distribution of the samples in the first sample set.
  • the feature extraction model to extract the feature vector of each sample in the sample set and the vector of each type of sample, and then determine each sample based on the similarity or distance between the feature vector of each sample and the vector of the corresponding category
  • the difficulty weight of the first sample set obtained in this way is obtained based on the characteristics of the sample itself. It has nothing to do with the structure of the training model and the training method used. It can well reflect the difficulty of the sample. The accuracy of labeling is very high, which solves the problem of bottlenecks in the training accuracy of the AI model due to the difficulty of labeling difficult samples.
  • the difficulty weight distribution of the first sample set may also be stored in the database 130.
  • the database 130 stores 3 sample sets, namely sample sets X1, X2, and X3.
  • this application provides a model training method, which can determine the difficulty weight distribution of samples in the first sample set before training the model to be trained, and then according to the task target of the model to be trained and the above difficulty weight distribution, Adjust the first sample set to obtain the second sample set, and finally use the second sample set to train the model to be trained.
  • the training device 200 can combine the complexity of the task target of the model to be trained and the difficulty weight of each sample to select an appropriate number of difficult samples for training, which solves the problem that the difficult samples are difficult to label and lead to AI.
  • the training accuracy of the model has a bottleneck problem, which improves the training accuracy of the AI model.
  • the training method provided in the present application will be described with an example in conjunction with FIG. 8 below.
  • the first sample set used to train the model to be trained includes two types of samples.
  • the first type of sample is the face image of ID1 (such as the face image of the character Ann at various angles), including samples X11 ⁇ X14
  • the second type of sample is the face image of the ID2 (such as the face image of the character Lisa at various angles) ), including samples X21 ⁇ X24, a total of 8 samples.
  • the training method provided by this application includes the following steps:
  • Step 1 Input each sample of each type of sample in the first sample set into a feature extraction model to obtain a feature vector of each sample.
  • the feature extraction model is used to extract facial features.
  • inputting samples X11 to X14 into the feature extraction model can obtain feature vectors A11 to A14, and inputting samples X21 to X24 into the feature extraction model to obtain feature vectors A21 to A24.
  • step S221 of the foregoing content please refer to step S221 of the foregoing content, which will not be repeated here.
  • Step 2 Determine the reference feature vector of each type of sample in the first sample set.
  • the reference feature vector of each type of sample can be the average of the feature vector of each type of sample, or the feature vector closest to the average, or after the feature vector of each type of sample is mapped to the 2D space,
  • the feature vector corresponding to the point in the most densely distributed area is determined as the reference feature information of this type of sample, and this application does not limit the method for determining the reference feature information.
  • FIG. 8 illustrates the feature vector closest to the average value as an example, such as the reference feature vector A14 and the reference feature vector A21 shown in FIG. 8. For details, please refer to step S222 of the foregoing content, which will not be repeated here.
  • Step 3 Determine the distance between each feature vector and the reference feature vector of the corresponding category.
  • the distance D11 between the feature vectors A14 and A11, the distance D12 between the feature vectors A14 and A12, the distance D13 between the feature vectors A13 and A14, and the distance between the feature vectors A14 and A14 can be calculated
  • the distance D21 between the feature vectors A21 and A22, the distance D22 between the feature vectors A21 and A23, the distance D23 between the feature vectors A21 and A24, and the feature vector A21 can be calculated in the same way.
  • the distance from A21 is 0.
  • the distance may be any of the cosine distance, Euclidean distance, Manhattan distance, Chebyshev distance, and Manhattan distance in the foregoing content, which is not specifically limited in this application.
  • step S223 reference may be made to step S223 and its optional steps in the foregoing content, which will not be repeated here.
  • Step 4 Determine the difficulty weight ⁇ of each sample in the first sample set, and obtain the difficulty weight distribution of the first sample set.
  • the difficulty weight The samples larger than the first threshold h 1 are represented by dark colors, that is, the difficulty weights of the samples X11 and X22 are higher than the threshold.
  • step S224 and its optional steps in the foregoing content, which will not be repeated here.
  • Step 5 Determine the target difficulty weight of the model to be trained according to the task target of the model to be trained, and adjust the difficulty weight distribution of the first sample set according to the target difficulty weight to obtain the second sample set.
  • step S230 and its optional steps in the foregoing content, which will not be repeated here.
  • Step 6 Use the second sample set to train the model to be trained.
  • the loss function of the model to be trained can be as shown in formula 3. This loss function makes the difficult sample with a large difficulty weight in the training process of the model to be trained increase the influence of the loss function, so that the model to be trained can be more concentrated on the loss function. Learning the characteristics of difficult samples is more inclined to use difficult samples for parameter update, so as to achieve the purpose of reinforcement learning.
  • the constant T in the difficulty weight can be set to a higher value in the early stage of training, so that the influence of the difficult sample in training the model to be trained reaches the highest value, and then the constant T in the difficulty weight can be set to a lower value at the end of the training period At this time, the model to be trained has tended to converge, and it is no longer necessary to favor difficult samples that consume more time, thereby increasing the training speed.
  • step S240 and its optional steps in the foregoing content, which will not be repeated here.
  • the above training method uses the feature extraction model to perform feature extraction on each sample in the first sample set, determines the reference feature vector of the same sample according to the feature vector extracted from each sample in the first sample set, and then according to the same sample
  • the distance between the feature vector of each sample and the reference feature vector determine the difficulty weight of each sample, and then adjust the difficulty weight distribution of the first sample set according to the difficulty weight, and use the second sample set after adjusting the difficulty weight distribution Train the model to be trained.
  • the training device 200 can combine the complexity of the task target of the model to be trained and the difficulty weight of each sample, and select an appropriate number of difficult samples for training.
  • FIG. 9 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 50.
  • the chip can be set in the training device 200 and the training device 200 in the foregoing content to complete the training work of the training unit 240 and the feature extraction work of the extraction module 211.
  • the algorithms of each layer in the convolutional neural network as shown in FIG. 5 can all be implemented in the chip as shown in FIG. 9.
  • the Neural-network Processing Unit (NPU) 900 can be mounted as a coprocessor to the main CPU (Host CPU), and the main CPU 800 assigns tasks.
  • the main CPU 800 is like a manager, responsible for determining which ones The data needs to be executed by the NPU core, so that an instruction is issued to trigger the NPU900 to process the data.
  • NPU900 can also be integrated into the CPU, such as Kirin 970, or it can be used as a separate chip.
  • the core part of the NPU 900 is the arithmetic circuit 903.
  • the arithmetic circuit 903 is controlled by the controller 904 to extract matrix data from the memory and perform multiplication operations, such as the convolution operation in the embodiment of FIG. 5.
  • the arithmetic circuit 903 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 903 is a two-dimensional systolic array. The arithmetic circuit 903 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 903 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to matrix B from the weight memory 902 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit takes the matrix A data and matrix B from the input memory 901 to perform matrix operations, and the partial or final result of the obtained matrix is stored in an accumulator 908.
  • the unified memory 906 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 902 through the direct memory access controller (DMAC) 905.
  • the input data is also transferred to the unified memory 906 through the DMAC.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 910 is used to interact with the storage unit access controller 905 and the instruction fetch buffer (IFB) 909 through the bus protocol (Advanced eXtensible Interface, AXI).
  • the bus interface unit 910 is used for the instruction fetch memory 909 to obtain instructions from the external memory, and is also used for the storage unit access controller 905 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the storage unit access controller 905 is mainly used to transfer the input data in the external storage to the unified storage 906 or to transfer the weight data to the weight storage 902 or to transfer the input data to the input storage 901.
  • the vector calculation unit 907 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 907 can store the processed output vector in the unified buffer 906.
  • the vector calculation unit 907 may apply a nonlinear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 907 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 903, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer (Instruction Fetch Buffer) 909 connected to the controller 904 is used to store instructions used by the controller 904; the controller 904 is used to call the instructions buffered in the instruction fetch memory 909 to control the working process of the arithmetic accelerator.
  • the unified memory 906, the input memory 901, the weight memory 902, and the fetch memory 909 are all on-chip memories.
  • the external memory is private to the NPU hardware architecture.
  • the external memory may be a double data rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), a high bandwidth memory (High Bandwidth Memory, HBM) or other readable and writable memory.
  • DDR SDRAM Double Data rate synchronous dynamic random access memory
  • HBM High Bandwidth Memory
  • FIG. 10 is a schematic diagram of the hardware structure of a computing device provided by the present application.
  • the computing device 1000 may be the training device 200 in the embodiment of FIG. 2 to FIG. 10.
  • the computing device 1000 includes a processor 1010, a communication interface 1020, a memory 1030, and a neural network processor 1050.
  • the processor 1010, the communication interface 1020, the memory 1030, and the neural network processor 1050 may be connected to each other through an internal bus 1040, and may also communicate through other means such as wireless transmission.
  • the embodiment of the present application takes the connection via a bus 1040 as an example.
  • the bus 1040 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus 1040 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used to represent in FIG. 10, but it
  • the processor 1010 may be composed of at least one general-purpose processor, such as a central processing unit (CPU), or a combination of a CPU and a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (Programmable Logic Device, PLD), or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field programmable logic gate array (Field-Programmable Gate Array, FPGA), a general array logic (Generic Array Logic, GAL), or any combination thereof.
  • the processor 1010 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1030, which enables the computing device 1000 to provide a wide variety of services.
  • the memory 1030 is used to store program codes, which are controlled by the processor 1010 to execute, so as to execute the processing steps of the training device 200 in any one of the embodiments in FIG. 2 to FIG. 8.
  • the program code may include one or more software modules.
  • the one or more software modules may be the software modules provided in the embodiment shown in FIG.
  • the acquisition unit may be used to acquire the first sample set
  • the determination unit It can be used to determine the difficulty weight distribution of the first sample set
  • the adjustment unit can be used to adjust the difficulty weight of the first sample set according to the difficulty weight corresponding to each sample in the first sample set and the task target of the model to be trained
  • the second sample set is obtained by distribution
  • the training unit can be used to train the model to be trained using the second sample set. Specifically, it can be used to perform step S210-step S230 and its optional steps, step 1-step 6 and its optional steps of the foregoing method.
  • the optional steps can also be used to perform other steps performed by the training device described in the embodiments of Figs. 2-8, which will not be repeated here.
  • this embodiment can be implemented by a general physical server, for example, an ARM server or an X86 server, or it can be implemented by a virtual machine based on a general physical server combined with NFV technology.
  • the virtual machine refers to software
  • a simulated complete computer system with complete hardware system functions and running in a completely isolated environment is not specifically limited in this application.
  • the neural network processor 1050 may be used to obtain an inference model through the training program and sample data of the memory 1030 to execute at least a part of the method discussed herein.
  • the neural network processor 1050 For the hardware structure of the neural network processor 1050, refer to FIG. 9 for details.
  • the memory 1030 may include a volatile memory (Volatile Memory), such as a random access memory (Random Access Memory, RAM); the memory 1030 may also include a non-volatile memory (Non-Volatile Memory), such as a read-only memory ( Read-Only Memory (ROM), Flash Memory (Flash Memory), Hard Disk Drive (HDD), or Solid-State Drive (SSD); the memory 1030 may also include a combination of the above types.
  • the memory 1030 may store the first sample set and/or the second sample set, and the memory 1030 may store program codes, which specifically may include program codes for executing other steps described in the embodiments of FIGS. 2-8, which will not be omitted here. Go ahead and repeat.
  • the communication interface 1020 may be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (Peripheral Component Interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface (for example, a cellular network interface or the use of a wireless local area network interface) to communicate with other devices or modules.
  • a wired interface such as an Ethernet interface
  • PCIe serial computer expansion bus
  • PCIe Peripheral Component Interconnect express
  • Ethernet interface such as an Ethernet interface
  • a wireless interface for example, a cellular network interface or the use of a wireless local area network interface
  • FIG. 10 is only a possible implementation of the embodiment of the present application.
  • the computing device may also include more or fewer components, which is not limited here.
  • Regarding the content that is not shown or described in the embodiment of the present application please refer to the relevant description in the embodiment described in FIG. 2 to FIG. 8, which will not be repeated here.
  • the computing device shown in FIG. 10 may also be a computer cluster composed of at least one server, which is not specifically limited in this application.
  • the embodiment of the present application also provides a computer-readable storage medium, which stores instructions in the computer-readable storage medium, and when the computer-readable storage medium runs on a processor, the method flow shown in FIGS. 2 to 8 is implemented.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product runs on a processor, the method flow shown in FIGS. 2 to 8 is realized.
  • the foregoing embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes at least one computer instruction.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data Wire (for example, coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (for example, infrared, wireless, microwave, etc.) to transmit data to another website, computer, server or data.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data that includes at least one set of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a high-density digital video disc (Digital Video Disc, DVD)), or a semiconductor medium.
  • the semiconductor medium may be an SSD.

Abstract

本申请提供了一种训练方法、装置及相关设备。该方法在对待训练模型进行训练之前,先确定第一样本集中样本的困难权重分布,然后根据待训练模型的任务目标和上述困难权重分布,对第一样本集进行调整,获得第二样本集,最后使用第二样本集对待训练模型进行训练。使用本申请提供的训练方法,可以结合待训练模型的任务目标的复杂程度和每个样本的困难权重,选择合适数量的困难样本进行训练,解决了困难样本难以标注导致待训练模型的训练精度出现瓶颈的问题,使得待训练模型的训练精度得到提升。

Description

一种训练方法、装置、设备以及计算机可读存储介质
本申请要求于2020年05月27日提交中国知识产权局、申请号为202010462418.X、申请名称为“一种训练方法、装置、设备以及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(Artificial Intelligence,AI)领域,尤其涉及一种训练方法、装置、设备以及计算机可读存储介质。
背景技术
随着科学技术的不断发展,AI模型在视频图像、语音识别、自然语言处理等相关领域得到了广泛应用。AI模型通常需要使用大量的样本对其进行训练,在训练AI模型时,困难样本(Hard samples)图像的作用往往大于简单样本图像。其中,困难样本指的是模型难以判别的样本,具体可以是模糊、曝光过度、轮廓不清晰的样本,还可以是与其他样本非常类似的样本。在AI模型的学习过程中,即便是大量的简单样本,都很难对AI模型的预测精度带来大幅提升,而困难样本图像往往会给AI模型的预测精度带来大幅度的提升。
但是,在对AI模型的训练过程中,通过人工去筛选困难样本是一个浪费人力和时间的工程,通过计算设备标注困难样本的精度又很差,困难样本难以标注的现状使得AI模型的训练精度出现瓶颈。
发明内容
本申请提供了一种训练方法、装置设备以及计算机可读存储介质,用于解决当前困难样本难以标注使得AI模型的训练精度出现瓶颈的问题。
第一方面,提供了一种训练方法,该方法包括以下步骤:
获取第一样本集,在确定第一样本集中样本的困难权重分布后,先根据待训练模型的任务目标以及第一样本集中样本的困难权重分布,调整第一样本集获得第二样本集,最后利用第二样本集对待训练模型进行训练。
实施第一方面描述的方法,可以在对待训练模型进行训练之前,先确定第一样本集中样本的困难权重分布,然后根据待训练模型的任务目标和上述困难权重分布,对第一样本集进行调整,获得第二样本集,最后使用第二样本集对待训练模型进行训练。这样,训练设备200在训练待训练模型的过程中,可以结合待训练模型的任务目标的复杂程度和每个样本的困难权重,选择合适数量的困难样本进行训练,解决了困难样本难以标注导致待训练模型的训练精度出现瓶颈的问题,使得待训练模型的训练精度得到提升。
在一种可能的实现方式中,待训练模型的任务目标包括待训练模型经训练完成后的应 用场景、待训练模型经训练完成后需实现的事件类型以及待训练模型的训练精度目标中的一种或者多种。待训练模型为AI模型,例如:神经网络模型。
实施上述实现方式,不同待训练模型的任务目标的难易程度不同,在训练一个用于实现简单任务目标的待训练模型,比如室内闸机场景的人脸识别,训练时使用的第二样本集可以包含较多困难权重小的样本,使用大量简单样本进行训练、较少量的困难样本进行辅助训练,可以在实现任务目标的同时,提高训练速度;反之,如果训练一个用于实现复杂任务目标的待训练模型,比如室外视频监控场景中的人脸识别,训练时使用的第二样本集可以包含较多困难权重大的样本,这样,使用大量困难样本进行训练、较少量简单样本进行辅助训练,可以使得待训练模型能够更加专注于困难样本的学习,针对性的提升待训练模型的训练精度,从而达到强化学习的目的。
在一种可能的实现方式中,根据待训练模型的任务目标和第一样本集中样本的困难权重分布,调整第一样本集,获得第二样本集时,可以先根据待训练模型的任务目标和第一样本集中样本的困难权重分布,确定用于训练待训练模型的样本集应达到的目标困难权重分布,然后增加或减少第一样本集中的样本数量,或者,改变第一样本集中部分样本的困难权重,获得第二样本集,其中,第二样本集中样本的困难权重分布等于或者近似于目标困难权重分布。
具体实现中,用于训练待训练模型的训练设备可以维护一个对应关系库,该对应关系库内存储有多个任务目标与多个目标困难权重分布的对应关系,这样,在训练设备确定了第一样本集困难权重分布之后,可以根据待训练模型的任务目标以及上述对应关系库,确定该任务目标对应的目标困难权重分布,从而根据第一样本集的困难权重分布和目标困难权重分布的差距,调整第一样本集的困难权重分布,获得用于训练待训练模型的第二样本集。
需要说明的,在根据目标困难权重分布对第一样本集进行调整时,调整后得到的第二样本集中样本的困难权重分布可以是等于目标困难权重分布,也可以是近似于目标困难权重分布。其中,近似于目标困难权重分布指的是第二样本集的困难权重分布于目标困难权重分布之间的差值小于第三阈值h 3,举例来说,如果第三阈值h 3=0.2,仍以上述例子为例,目标困难权重分布为困难样本:简单样本=3:2=1.5,第一样本集的困难权重分布为困难样本:简单样本=3:7,调整第一样本集后,获得的第二样本集的困难权重分布也可以是8:5=1.6,其中,第二样本集的困难权重分布与目标困难权重分布之间的差值1.6-1.5=0.1小于第三阈值h 3=0.2。应理解,上述举例仅用于说明,并不能构成具体限定。
实施上述实现方式,在对第一样本集的困难权重分布进行调整时,根据待训练模型的任务目标来确定目标困难权重分布,再根据目标困难权重分布对第一样本集的困难权重分布进行调整,这样获得的第二样本集更适用于训练待训练模型,可以针对性的提升待训练模型的训练精度,达到强化学习的目的。
在一种可能的实现方式中,确定第一样本集中样本的困难权重分布时,可以先将第一样本集的每个样本输入至特征提取模型,获得每个样本的特征信息,然后根据每个样本的特征信息,确定第一样本集中的多类样本的参考特征信息,再基于每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定每个样本对应的困难权重,从而获得第一样 本集中样本的困难权重分布。
具体实现中,特征提取模型用于提取样本的特征信息,可以是在获取第一样本集之前训练好的AI模型,特征提取模型可以采用业界已有的用于提取样本特征的AI模型中的任一种,比如,用于目标检测的特征描述子(Histogram of Oriented Gradient,HOG)、局部二值模式(Local Binary Pattern,LBP)、卷积神经网络的卷积层等等,本申请不作具体限定。并且,上述样本集的来源可以包括手机或监控摄像头、本地离线数据以及互联网公开数据等等,本申请不作具体限定。
特征提取模型提取的每个样本的特征信息具体可以是特征向量或特征矩阵。假设第一类样本中的样本数量为n个,该类样本中每一个样本输入特征提取模型后获得的特征信息分别为特征向量B 1,B 2,…,B n,那么该类样本的参考特征信息可以是这n个特征向量的平均向量A,也可以是n个特征向量中最接近上述平均向量A的一个特征向量B j,其中,j∈n,还可以是将每类样本的特征向量映射到2D空间之后,将分布最密集的区域的点对应的特征向量确定为该类样本的参考特征信息,本申请不对参考特征信息的确定方法进行限定。
需要说明的,特征信息是特征向量的情况下,每个样本的困难权重可以根据每个样本的特征向量与对应类别的参考特征向量之间的距离来确定,每个样本的特征向量与对应类别的参考特征向量之间的距离越大,表该样本的特征向量与对应类别的参考特征向量之间的相似度越小,该样本的困难权重越大,也就是说距离与困难权重之间呈正比例关系,相似度与困难权重之间呈反比例关系。
实施上述实现方式,用特征提取模型提取样本集内每个样本的特征信息和每类样本的信息,再根据每个样本的特征信息与对应类别的参考特征信息之间的相似度或者距离,确定每个样本的困难权重,这样获得的第一样本集的困难权重分布是基于样本本身的特征获得的,与训练模型的结构以及训练使用的方法无关,可以很好的反映出样本的困难程度,困难样本标注的精度很高,从而解决了由于困难样本难以标注使得AI模型的训练精度出现瓶颈的问题。
在一种可能的实现方式中,在利用第二样本集对待训练模型进行训练之前,上述方法还可以包括以下步骤:根据第二样本集中样本的困难权重分布,调整待训练模型的损失函数的权重参数。
举例来说,如果待训练模型的任务目标常用的损失函数为Loss0,样本的权重参数为α i,那么待训练模型的损失函数Loss1的公式可以如下:
Loss1=α iLoss
实施上述实现方式,困难权重越大的样本输入待训练模型后,获得的损失函数值越大,使用该损失函数对待训练模型进行反向传播监督训练,可以使得待训练模型更加倾向于利用困难样本进行参数更新待训练模型,可以更加集中于学习困难样本的特征,更倾向于利用困难样本进行参数更新,从而达到待训练模型针对困难样本进行强化训练的目的,进而提升模型对困难样本的特征表达能力。
第二方面,提供了一种训练装置,该装置包括:获取单元,用于获取第一样本集,第一样本集包括多个样本;确定单元,用于确定第一样本集中样本的困难权重分布;调整单元,用于根据待训练模型的任务目标和第一样本集中样本的困难权重分布,调整第一样本 集,获得第二样本集;训练单元,用于利用第二样本集,对待训练模型进行训练。
在一种可能的实现方式中,待训练模型的任务目标包括待训练模型经训练完成后的应用场景、待训练模型经训练完成后需实现的事件类型以及待训练模型的训练精度目标中的一种或者多种。
在一种可能的实现方式中,调整单元具体用于根据待训练模型的任务目标和第一样本集中样本的困难权重分布,确定用于训练待训练模型的样本集应达到的目标困难权重分布;调整单元用于增加或减少第一样本集中的样本数量,或者,改变第一样本集中部分样本的困难权重,获得第二样本集,其中,第二样本集中样本的困难权重分布等于或者近似于目标困难权重分布。
在一种可能的实现方式中,确定单元具体用于将第一样本集的每个样本输入至特征提取模型,获得每个样本的特征信息,其中,每个样本对应一个类别;确定单元用于根据每个样本的特征信息,确定第一样本集中的多类样本的参考特征信息,其中,每类样本包括至少一个类别相同的样本;确定单元用于基于每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定每个样本对应的困难权重;确定单元用于根据第一样本集中每个样本的困难权重,获得第一样本集中样本的困难权重分布。
在一种可能的实现方式中,在利用第二样本集,对待训练模型进行训练之前,调整单元还用于根据第二样本集中样本的困难权重分布,调整待训练模型的损失函数的权重参数。
第三方面,提供了一种计算机程序产品,包括计算机程序,当计算机程序被计算设备读取并执行时,实现如第一方面所描述的方法。
第四方面,提供了一种计算机可读存储介质,包括指令,当指令在计算设备上运行时,使得计算设备实现如第一方面描述的方法。
第五方面,提供了一种计算设备,包括处理器和存储器,处理器执行存储器中的代码实现如第一方面描述的方法。
第六方面,提供了一种芯片,包括存储器和处理器;该存储器与处理器耦合,处理器包括调制解调处理器,存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,处理器从存储器中读取计算机指令,以使得所述芯片执行如第一方面描述的方法。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
图1是一种训练和预测系统的结构示意图;
图2是一种应用场景中的困难样本示例图;
图3是本申请提供的一种训练设备的结构示意图;
图4是本申请提供的一种训练方法的流程示意图;
图5是一种卷积神经网络的结构示意图;
图6是一种应用场景中每类样本参考特征信息的示例图;
图7是一种应用场景中的第一样本集的数据分布于第二样本集的数据分布示例图;
图8是本申请提供的训练方法在一种应用场景中的流程示意图;
图9是本申请提供的一种芯片的结构示意图;
图10是本申请提供的一种计算设备的结构示意图。
具体实施方式
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。
首先,对本申请涉及的部分术语进行解释说明。
损失函数(Loss Function):损失函数是用来估量模型的预测值f(x)与真实值y的不一致程度,通常为一个非负实值函数。损失函数的值越小,模型的鲁棒性越好,损失函数一般用于调节网络学习方向。例如,一个5分类问题,输入的一张图片的展示分类结果为第4类,那么图片的真实值可以是y=(0,0,0,1,0),如果模型的预测结果为f(x)=(0.1,0.15,0.05,0.6,0.1),此时损失函数的值为-log(0.6)。如果损失函数的值的阈值为-log(0.9),那么该模型仍需要进一步的训练,通过损失函数来调节网络学习方向,能够获得最终性能良好的模型。其中,上述损失函数的公式仅用于举例说明,本申请不对损失函数的具体公式进行限定。
特征提取(Feature Extraction):对某一测量值进行变换,以突出该测量值具有代表性特征的一种方法。
反向传播:神经网络可以采用误差反向传播(Back Propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息(比如损失函数的值)来更新初始的神经网络模型中的参数,从而使得误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型参数,例如权重矩阵。
其次,对本申请涉及的应用场景进行解释说明。
人工智能(Artificial Intelligence,AI):是利用数字计算机或者数字计算机控制的计算设备模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。人工智能的应用场景十分广泛,比如人脸识别、车辆识别、行人重识别、数据处理应用等等。
AI模型是一种实现AI的数学方法集合。可以使用大量的样本对AI模型进行训练来使训练完成的AI模型获得预测的能力,例如,训练一个分类垃圾邮件的模型,训练阶段可以先将一个已标注出多个垃圾邮件标签和多个非垃圾邮件标签的样本集对神经网络进行训练,神经网络不断捕捉这些邮件和标签的联系对网络结构参数进行自我调整和完善,然后在预测阶段,神经网络可以对没有标签的新邮件进行是否是垃圾邮件的分类。应理解,上述举例用于说明,并不能构成具体限定。
下面对AI模型的训练和预测系统的结构进行解释说明。如图1所示,图1是一种AI模型训练和预测系统的架构图,该系统100是AI领域常用的系统架构,该系统1000包括:训练设备100、执行设备200、数据库130、客户设备140以及数据采集设备150。该系统100中的各个部件可以通过网络相互连接,这里的网络可以是有线网络也可以是无线网络, 还可以是二者的混合。其中,
训练设备200可以是物理服务器,比如X86服务器、ARM服务器等等,也可以是基于通用的物理服务器结合网络功能虚拟化(Network Functions Virtualization,NFV)技术实现的虚拟机(Virtual Machine,VM),虚拟机指通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的完整计算机系统,比如云数据内的虚拟机,本申请不作具体限定。
训练设备200用于使用数据库130中的样本集对待训练模型进行训练,获得目标模型,并将其发送至执行设备100。具体地,训练设备200可以在对待训练模型进行训练时,将待训练模型的输出数据与样本数据的标签进行对比,并根据对比结果不断调整待训练模型的结构参数,直到训练设备200输出数据与样本数据的标签小于一定的阈值,从而完成待训练模型的训练,获得目标模型。其中,这里的待训练模型以及目标模型可以是任一种AI模型,比如上述例子中用于分类垃圾邮件的神经网络模型,也可以是图像分类模型,也可以是语义识别模型等等,本申请不作具体限定。数据库130中维护的样本集不一定都来自于数据采集设备150,也有可能从其他设备接收得到。数据库130可以是本地数据库,也可以是云端或者其他第三方平台的数据库,本申请不作具体限定。
执行设备100可以是终端,如手机终端、平板电脑、笔记本电脑、增强现实/虚拟现实、车载终端等等,还可以是服务器或云端设备等,本申请不作具体限定。
执行设备100用于根据训练设备200训练好的目标模型实现各种各样的功能。具体的,在图1中,用户可以通过客户设备140向执行设备100输入数据,使用目标模型对输入数据进行预测,获得输出结果。执行设备100可以将输出结果返回给客户设备140,以供用户查看执行设备100输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式;执行设备100也可以将输出结果作为新的样本,存入数据库130,以供训练设备200使用新的样本重新调整目标模型的结构参数,提升目标模型性能。
举例来说,客户设备140为手机终端,执行设备100为云端设备,训练好的目标模型为语义识别模型,用户可以通过客户设备140向执行设备输入待识别的文字数据,执行设备100通过目标模型对上述待识别的文字数据进行语义识别,将语义识别的结果返回至客户设备140,使得用户可以通过用户设备(手机终端)查看语义识别的结果。
值得注意的,附图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据库130相对训练设备200是外部存储器,在其它情况下,也可以将数据库130置于训练设备200中,本申请不作具体限定。
综上可知,AI领域各种应用的实现依赖于AI模型,通过AI模型实现不同的功能,例如分类、识别、检测等等,而AI模型需要使用样本集预先训练后才能放入执行设备中使用。在使用样本集的样本数据对AI模型进行训练时,困难样本(HardSamples)的作用往往大于简单样本。其中,困难样本指的是AI模型难以判别的样本,具体可以分为两类,一类是模糊、曝光过度、轮廓不清晰的样本,该类样本无论采用何种算法的AI模型、何种初始化参数都会让AI模型预测错误;另一类是与其他样本非常类似、导致AI模型难以区分的样本,该类样本仅对于当前的AI模型来说是困难样本,但是并不是对于所有的AI模型而言 均为困难样本。举例来说,如图1所示,在训练一个用于识别“吉娃娃”这一宠物犬的AI模型时,图1中带有“曲奇”标签的样本1、样本3和样本5与“吉娃娃”的轮廓外形难以区分,因此是困难样本。在训练一个用于识别“猫”的AI模型时,图1中带有“曲奇”标签的样本1、样本3和样本5与“猫”有较好的区分度,因此不是困难样本。应理解,图1仅用于举例说明,并不能构成具体限定。
在AI模型的训练过程中,即便是大量的简单样本,都很难对AI模型的预测精度带来大幅提升,而困难样本往往会给AI模型的预测精度带来大幅度的提升。因此,如何从大量训练样本中筛选出困难样本对AI模型进行强化训练,一直是研究人员十分关注的问题。
一般来说,困难样本可以通过人工标注或者机器标注的方式获得。人工标注困难样本不仅是一个浪费人力和时间的工程,因为个人认知偏差、工作疲劳等原因标注精度也无法保证,并且,计算设备是通过检查每个像素的方式获得样本特征,一些人眼看起来并不类似的样本,对于AI模型来说也可能属于困难样本,导致人工标注困难样本的精度很差。
机器标注困难样本虽然简便快捷,但是标注精度很差。如果仅仅将预测错误的样本标注为困难样本,将漏标很多困难样本,因为预测正确的样本也有可能是困难样本。举例来说,样本A的标签为(0,1)表示该样本属于第2类,如果样本A输入分类模型M 1后的预测向量为(0.4,0.6),即分类模型M 1的分类结果显示样本A属于第2类,分类结果是正确的,但是预测向量(0.4,0.6)与样本标签(0,1)的差距很大,损失函数的值也很大,样本A虽然是分类正确的样本,但是也是该分类模型M 1难以判别的样本,属于困难样本。因此,将分类错误的样本作为困难样本,标注精度很差。如果将损失函数的值较大的样本作为困难样本,可能会将一些简单样本也错误的标注成困难样本,参考前述内容可知,损失函数是用来估量模型的预测值与真实值的不一致程度,而造成模型的预测值与真实值不一致的原因很多,可能是因为该样本确实是困难样本,也可能是选择的模型结构或者训练方法有缺陷,而样本本身并不是难判别的困难样本。因此,将损失函数的值较大的样本作为困难样本,标注精度也很差。
综上可知,对AI模型的训练过程中,通过人工去筛选困难样本是一个浪费人力和时间的工程,通过计算设备标注困难样本的精度又很差,而困难样本难以标注的现状使得AI模型的训练精度出现瓶颈。
为了解决上述困难样本难以标注导致AI模型的训练精度出现瓶颈的问题,本申请提供了一种训练设备200,该训练设备200可以适用于图1所示的AI模型训练和预测系统中,如图3所示,训练设备200可以包括获取单元210、确定单元220、调整单元230、数据库140、数据库150以及训练单元240。
获取单元210用于获取第一样本集,其中,第一样本集包括多个样本。
确定单元220用于确定第一样本集中样本的困难权重分布。
其中,样本的困难权重越高,该样本对于待训练模型来说越属于困难样本,样本的困难权重越低,该样本对于待训练模型来说越属于简单样本。样本的困难权重分布指的是每种困难权重对应的样本数量之比,比如样本集A中困难样本权重为1的样本数量为1000个,困难样本权重为2的样本数量为2000个,困难样本权重为3的样本数量为3000个, 那么样本集A中样本的困难权重分布为1:2:3,应理解,上述举例仅用于说明,并不能作为具体限定。
具体实现中,确定单元220可以使用数据库150中的特征提取模型来确定第一样本集中样本的困难权重分布。具体地,确定单元220可以使用数据库150中的特征提取模型,对第一样本集内的每个样本进行特征提取,获得每个样本的特征信息,然后根据每类样本中每个样本的特征信息,确定每类样本的参考特征信息,最后根据每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定每个样本对应的困难权重。举例来说,确定单元220可以将第一样本集输入数据库150中的特征提取模型,获得第一样本集中每个样本的特征向量,然后将每类样本的特征向量的平均向量作为该类样本的参考特征信息,最后根据每个样本的特征向量与对应类别的平均向量之间的相似度或者距离,来确定每个样本对应的困难权重。
调整单元230用于根据每个样本的困难权重和待训练模型的任务目标,调整第一样本集的困难权重分布,获得第二样本集。
在一实施例中,调整单元230可以先根据待训练模型的任务目标确定用于训练待训练模型的样本集应达到的目标困难权重分布,然后根据第一样本集中样本的困难权重分布,对第一样本集的样本数量进行增加或者减少,或者,对第一样本集中部分样本进行改变,获得第二样本集,使得第二样本集中样本的困难权重分布等于或近似于目标困难权重分布。
举例来说,如果第一样本集有3种困难权重(分别为α 1、α 2和α 3)的样本,第一样本集的困难权重分布为α 1:α 2:α 3=1:2:3,调整单元230可以先根据待训练模型的任务目标的难易程度,确定用于待训练模型的目标困难权重分布为α 1:α 2:α 3=1:1:1,然后对第一样本集进行调整,可以减少困难权重为α 2与α 3的样本数量,也可以增加困难权重为α 1的样本,使得第一样本集的困难权重分布变为1:2:3,从而获得第二样本集。应理解,上述举例仅用于说明,本申请不对困难权重的数量进行限定。
可以理解的,训练一个用于实现简单任务目标的待训练模型,比如室内闸机场景的人脸识别,训练时使用的第二样本集可以包含较多困难权重小的样本,这样,使用大量简单样本进行训练、较少量的困难样本进行辅助训练,可以在实现任务目标的同时,提高训练速度;反之,如果训练一个用于实现复杂任务目标的待训练模型,比如室外视频监控场景中的人脸识别,训练时使用的第二样本集可以包含较多困难权重大的样本,这样,使用大量困难样本进行训练、较少量简单样本进行辅助训练,可以使得待训练模型能够更加专注于困难样本的学习,针对性的提升待训练模型的训练精度,从而达到强化学习的目的。
训练单元240用于利用第二样本集,对待训练模型进行训练,获得训练好的目标模型。
具体实现中,训练单元240使用第二样本集对训练模型进行训练之前,可以先根据第二样本集中每个样本的困难权重,调整待训练模型的损失函数的权重参数,然后利用第二样本集对待训练模型进行训练时,根据损失函数对待训练模型进行反向传播监督训练,获得目标模型。其中,待训练模型的损失函数中,第二样本集每个样本的困难权重与损失函数的值呈正比例关系,使得困难权重大的困难样本对损失函数的影响更大,AI模型可以更加集中于学习困难样本的特征,更倾向于利用困难样本进行参数更新,从而达到待训练模型针对困难样本进行强化训练的目的,提高待训练模型的性能。
需要说明的,图3中所示设备和单元之间的位置关系不构成任何限制,例如,在附图3中,数据库130相对训练设备200是外部存储器,在其它情况下,也可以将数据库130置于训练设备200中;数据库140以及数据库150相对训练设备200是内部存储器,在其他情况下,也可以将数据库140和/或数据库150置于外部存储器中。
综上可知,本申请实施例提供的训练设备200,可以在对待训练模型进行训练之前,先确定第一样本集中样本的困难权重分布,然后根据待训练模型的任务目标和上述困难权重分布,对第一样本集进行调整,获得第二样本集,最后使用第二样本集对待训练模型进行训练。这样,训练设备200在训练待训练模型的过程中,可以结合待训练模型的任务目标的复杂程度和每个样本的困难权重,选择合适数量的困难样本进行训练,解决了困难与安保难以标注导致的AI模型训练精度出现瓶颈的问题,使得AI模型的训练精度得到提升。
下面对本申请提供的训练方法进行详细描述,该方法应用于图3实施例中的训练设备200。如图4所示,该方法可以包括以下步骤:
S210:训练设备200获取第一样本集,其中,该第一样本集包括多个样本。
其中,样本可以是任何形式的样本,比如图像样本、文字样本、语音样本、生物数据(例如指纹、虹膜)样本等等。第一样本集可以包括多个类别的样本,比如一类样本全部都是“曲奇”图像,一类样本全部是同一个人脸的各个角度的图像,一类样本全部都是同一个型号的车辆在不同角度和不同场景中的图像,具体可以根据待训练模型的任务目标来对第一样本集进行分类。举例来说,如果待训练模型的任务目标是人脸识别,那么可以将同一个人物的人脸图像分为一个类别,比如类别1为小明的人脸照片,类别2为小刚的人脸照片。应理解,上述举例仅用于说明,并不能构成具体限定。
S220:训练设备200确定第一样本集中样本的困难权重分布。
在一实施例中,可以通过特征提取模型对每个样本进行特征提取后,根据提取到的特征信息来确定每个样本的困难权重,再获得第一样本集中样本的困难权重分布。具体地,可以将第一样本集的每个样本输入特征提取模型,获得每个样本的特征信息,其中,每个样本对应一个类别,然后根据每个样本的特征信息,确定第一样本集中的多类样本的参考特征信息,其中,每类样本包括至少一个类别相同的样本,基于每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定每个样本对应的困难权重,根据第一样本集中每个样本的困难权重,获得第一样本集中样本的困难权重分布。步骤S220将在下文的步骤S221-步骤S224进行描述。
S230:训练设备200根据待训练模型的任务目标和第一样本集中样本的困难权重分布,调整第一样本集,获得第二样本集。
其中,待训练模型的任务目标包括待训练模型经训练完成后的应用场景、待训练模型经训练完成后需实现的事件类型以及待训练模型的训练精度目标中的一种或者多种,举例来说,视频监控的应用场景下的人脸识别模型和手机解锁的应用场景下的人脸识别模型,在训练时所需样本的目标困难权重分布是不同的;身份识别这一事件类型和服装识别这一事件类型,在训练模型时所需样本的目标困难权重分布时不同的;低训练精度目标和高训练精度目标的待训练模型,在训练时所需样本的目标困难权重分布也是不同的。应理解, 上述举例仅用于说明,并不能构成具体限定。
可以理解的,训练一个用于实现简单任务目标的模型,比如室内闸机场景的人脸识别,训练时使用的第二样本集可以包含较多困难权重小的样本,这样,使用大量简单样本进行训练、较少量的困难样本进行辅助训练,可以在实现任务目标的同时,提高训练速度;反之,如果训练一个用于实现复杂任务目标的模型,比如室外视频监控场景中的人脸识别,训练时使用的第二样本集可以包含较多困难权重大的样本,这样,使用大量困难样本进行训练、较少量简单样本进行辅助训练,可以使得待训练模型能够更加专注于困难样本的学习,针对性的提升待训练模型的训练精度,从而达到强化学习的目的。
具体实现中,训练设备200可以维护一个对应关系库,该对应关系库内存储有多个任务目标与多个目标困难权重分布的对应关系,这样,在训练设备200确定了数据库130内第一样本集困难权重分布之后,可以根据待训练模型的任务目标以及上述对应关系库,确定该任务目标对应的目标困难权重分布,从而根据第一样本集的困难权重分布和目标困难权重分布的差距,调整第一样本集的困难权重分布,获得用于训练待训练模型的第二样本集。需要说明的,上述对应关系库可以存储于训练设备200的内部存储器中,也可以存储于训练设备200的外部存储器中,具体可以由训练设备的处理能力和存储能力决定,本申请不作具体限定。
在一实施例中,在根据目标困难权重分布对第一样本集进行调整时,可以增加或减少第一样本集中的样本数量,或者,改变第一样本集中部分样本的困难权重,获得第二样本集。举例来说,如果任务目标为室外视频监控场景中的人脸识别,根据对应关系库确定该任务目标需要的目标困难权重分布为困难样本:简单样本=3:2,其中,困难样本为困难权重α高于第一阈值h 1的样本,简单样本为困难权重α低于第二阈值h 2的样本,而此时第一样本集P1中包含10000个样本,困难样本数量为3000,简单样本数量为7000,即第一样本集的困难权重分布为困难样本:简单样本=3:7,在对第一样本集调整困难权重分布时,可以通过数据增强的方式,将3000个困难样本扩充为6000个困难样本,再从7000个简单样本中随机选择4000个简单样本,将这6000个困难样本和4000个简单样本组成第二样本集P2,第二样本集P2的困难权重分布为困难样本:简单样本=3:2。
需要说明的,在根据目标困难权重分布对第一样本集进行调整时,调整后得到的第二样本集中样本的困难权重分布可以是等于目标困难权重分布,也可以是近似于目标困难权重分布。其中,近似于目标困难权重分布指的是第二样本集的困难权重分布于目标困难权重分布之间的差值小于第三阈值h 3,举例来说,如果第三阈值h 3=0.2,仍以上述例子为例,目标困难权重分布为困难样本:简单样本=3:2=1.5,第一样本集的困难权重分布为困难样本:简单样本=3:7,调整第一样本集后,获得的第二样本集的困难权重分布也可以是8:5=1.6,其中,第二样本集的困难权重分布与目标困难权重分布之间的差值1.6-1.5=0.1小于第三阈值h 3=0.2。应理解,上述举例仅用于说明,并不能构成具体限定。
再举例来说,如图7左侧的柱状图所示,训练设备200基于每个样本的困难权重统计整个第一样本集的困难权重分布后,困难权重α 1=1的样本数量为3000,困难权重α 2=2的样本数量为2500,困难权重α 3=3的样本数量为2000,困难权重α 4=4的样本数量为1000,困难权重α 5=5的样本数量为500,即第一样本集的困难权重分布为α 1:α 2:α 3:α 4: α 5=6:5:4:2:1。假设当前的任务目标所需的目标困难权重分布如图7右侧的柱状图所示,即α 1:α 2:α 3:α 4:α 5=25:25:20:18:16,也就是说,困难权重α为1的样本数量为2500,困难权重α为2的样本数量为2500,困难权重α为3的样本数量为2000,困难权重α为4的样本数量为1800,困难权重α为5的样本数量为1600。此时困难权重α为4和困难权重α为5的样本数量不足,因此可以增加困难权重α为4和困难权重α为5的样本,最终可以将第一样本集的困难权重分布调整为如图7右侧的柱状图所示,从而获得第二样本集。应理解,图7仅用于举例说明,并不能构成具体限定。
具体实现中,增加第一样本集中的样本数量,或者,改变第一样本集中部分样本的困难权重,这一过程可以通过数据增强来实现,数据增强可以是指,对部分困难样本或者简单样本进行随机扰动从而获得更多的困难样本或者简单样本,其中,所述随机扰动包括加噪声点、改变光照信息、改变环境信息(比如天气、背景、时间)等等。数据增强还可以是指,将部分困难样本或者简单样本输入生成对抗网络(Generative Adversarial Networks,GAN)后,获得更多的困难样本或者简单样本,其中,GAN可以包括判别网络以及生成网络,其中,生成网络用于根据输入数据生成图片,判别网络用于判别输入的图片是否是真实图片。GAN在训练过程中,生成网络的目标是尽量生成出真实的图片使得判别网络的输出结果为真实,判别网络的目标是尽量判别出准确的结果,即判别出生成网络生成的图片的数据结果为虚假,两个网络形成一个动态“博弈”的过程,最终训练好的GAN可以生成“以假乱真”的图片,从而获得更多的困难样本或者简单样本。
S240:训练设备200利用第二样本集,对待训练模型进行训练。
在一实施例中,使用第二样本集对待训练模型进行训练之前,根据第二样本集中样本的困难权重,调整待训练模型的损失函数的权重参数,然后利用第二样本集对待训练模型进行训练时,可以根据上述损失函数对待训练模型进行反向传播监督训练,获得训练后的待训练模型,其中,该损失函数中每个样本的困难权重与损失函数的值呈正比例关系,因此,困难权重越大的样本输入待训练模型后,获得的损失函数值越大,使用该损失函数对待训练模型进行反向传播监督训练,可以使得待训练模型更加倾向于利用困难样本进行参数更新。
具体地,如果待训练模型的任务目标常用的损失函数为Loss0,那么待训练模型的损失函数Loss1的公式可以如下:
Loss1=α iLoss        (1)
这样,困难权重大的困难样本对损失函数的影响更大,在使用上述损失函数对待训练模型进行反向传播监督训练时,待训练模型可以更加集中于学习困难样本的特征,更倾向于利用困难样本进行参数更新,从而达到待训练模型针对困难样本进行强化训练的目的,进而提升模型对困难样本的特征表达能力。应理解,公式3仅用于举例说明,待训练模型的损失函数Loss1的公式还可以是其他Loss1与α i呈正比例关系的公式,本申请不作具体限定。
举例来说,如果Loss0的公式为:
Figure PCTCN2021091597-appb-000001
其中,w和b是待训练模型的参数,x是输入数据,y是输出数据,m是输入数据的数量,n为待训练模型分类的类别数,举例来说,待训练模型是五分类模型,那么n=5。在待训练模型的训练过程中,Loss1的公式可以是:
Figure PCTCN2021091597-appb-000002
应理解,上述公式仅用于举例说明,Loss0的具体公式可以采用业界已有的各种Loss公式中的任一种,比如均方差损失函数、交叉熵损失函数等等,本申请不作具体限定。
下面对上述步骤S220处,训练设备200确定第一样本集中样本的困难权重分布的具体过程进行详细描述。该步骤可以详细分为以下步骤:
步骤S221:将第一样本集的每个样本输入至特征提取模型,获得每个样本的特征信息,其中,每个样本对应一个类别。
具体实现中,特征提取模型提取的每个样本的特征信息具体可以是特征向量或特征矩阵,为例便于本申请更好的被理解,下文统一以特征信息为特征向量为例进行举例说明。其中,特征向量是以向量形式表示的样本的数值型特征,可以较为有效的表征样本特征,通常情况下,特征向量是一个多维度的向量,比如512维的向量、1024维的向量,本申请不对向量的具体维度进行限定。需要说明的,特征提取模型用于提取样本的某一类特征,不同的特征提取模型针对同一样本提取出的特征向量是不同的,用于提取人脸属性的特征提取模型可以提取出样本A的眼睛、鼻子和嘴巴等特征,用于提取车辆属性的特征提取模型可以提取出样本A的车轮、钢铁材质等特征。因此,特征提取模型可以根据待训练模型的任务目标来确定,如果待训练模型是人脸识别网络,那么步骤S221使用的特征提取模型是用于提取人脸属性特征的特征提取模型,如果待训练模型是车辆识别网络,那么步骤S220使用的特征提取模型时用于提取车辆属性特征的特征提取模型,应理解,上述举例仅用于说明,并不能构成具体限定。
可以理解的,简单样本和困难样本输入特征提取模型后获得的特征向量是不同的,简单样本提取出的特征向量质量很好,困难样本提取出的特征向量质量很差,其中,特征向量的质量取决于其区分不同类别的图像样本的能力,良好的特征应该是信息丰富的,不受噪声和一系列变换的影响,输入分类器之后可以快速获得样本所属的类别,相反,质量差的特征信息匮乏,输入分类器之后很难确定样本所属的类别。举例来说,用于提取人脸属性的特征提取模型,对简单样本进行特征提取时,可以很容易地提取出该样本包含眼睛、鼻子和嘴巴的特征,困难样本则难以提取出该样本是否包含眼睛鼻子和嘴巴的特征,因此,简单样本之间的特征向量应该是类似的,困难样本的特征向量与简单样本的特征向量是不同的。
具体实现中,数据库150中的特征提取模型用于提取样本的特征信息,可以是在步骤S210之前训练好的AI模型,特征提取模型可以采用业界已有的用于提取样本特征的AI模型中的任一种,比如,用于目标检测的特征描述子(Histogram of Oriented Gradient,HOG)、局部二值模式(Local Binary Pattern,LBP)、卷积神经网络的卷积层等等,本申请不作具体限定。并且,上述样本集的来源可以包括手机或监控摄像头、本地离线数据以及互联网公 开数据等等,本申请不作具体限定。
下面以卷积神经网络为例,对特征提取模型进行举例说明。
卷积神经网络(Convolutional Neuron Network,CNN)是一种带有卷积结构的深度神经网络,是一种深度学习(Deep Learning)架构,深度学习架构是指通过计算设备学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(Feed-Forward)人工神经网络,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。如图5所示,卷积神经网络(CNN)300可以包括输入层310,卷积层/池化层320以及神经网络层330,其中,池化层为可选的网络层。
(1)卷积层/池化层320:如图5所示卷积层/池化层320可以包括如示例321-326层,在一种实现中,321层为卷积层,322层为池化层,323层为卷积层,324层为池化层,325为卷积层,326为池化层;在另一种实现方式中,321、322为卷积层,323为池化层,324、325为卷积层,326为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层321为例,卷积层321可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素,这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(Depth Dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等等,该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出,获得最终的特征向量。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取特定信息,生成特征向量,再将特征向量输入神经网络层进行分类处理,从而帮助卷积神经网络300进行正确的预测。
当卷积神经网络300有多个卷积层的时候,初始的卷积层(例如321)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络300深度的加深,越往后的卷积层(例如326)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
(2)池化层:由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图5中320所示例的321-326各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入 图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
(3)神经网络层330:
在经过卷积层/池化层320的处理后,卷积神经网络300还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层320只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络300需要利用神经网络层330来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层330中可以包括多层隐含层(如图5所示的331、332至33n)以及输出层340,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到。
在神经网络层330中的多层隐含层之后,也就是整个卷积神经网络300的最后层为输出层340,该输出层340具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络300的前向传播(如图5由310至340的传播为前向传播)完成,反向传播(如图5由340至310的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络300的损失及卷积神经网络300通过输出层输出的结果和理想结果之间的误差。
综上可知,输入层310、卷积层/池化层320用于提取样本特征,获得样本的特征向量,神经网络层330用于根据卷积层/池化层320提取的特征向量对输入图像进行分类,因此,本申请所需的特征提取模型,可以简单理解为只包含卷积层/池化层320、不包含神经网络层330的卷积神经网络。应理解,上述举例仅用于说明,并不能构成具体限定。
步骤S222:根据每个样本的特征信息,确定第一样本集中多类样本的参考特征信息,其中,每类样本包括至少一个类别相同的样本。
举例来说,假设第一类样本中的样本数量为n个,该类样本中每一个样本的特征信息分别为特征向量B 1,B 2,…,B n,那么该类样本的参考特征信息可以是这n个向量的平均向量A,也可以是n个向量中最接近上述平均向量A的一个向量B j,其中,j∈n,同理,可以获得其他类别样本的向量,当参考特征信息用向量的形式表示时,参考特征信息也称为参考特征向量。举例来说,如果每个样本的特征信息是512维的特征向量,那么将步骤S221获得的多维特征向量映射到2D空间,以坐标点的形式绘制在平面直角坐标系中,每类样本的参考特征信息可以如图6所示。应理解,图6仅用于举例说明,每类样本的参考特征信息还可以是将每类样本的特征向量映射到2D空间之后,将分布最密集的区域的点对应的特征向量确定为该类样本的参考特征信息,本申请不对参考特征信息的确定方法进行限定。
步骤S223:基于每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定每个样本对应的困难权重。
其中,每个样本的特征信息与对应类别的参考特征信息之间的相似度越大,该样本的困难权重越小,也就是说相似度与困难权重之间呈反比例关系,可以理解的,在特征信息 是特征向量的情况下,每个样本的困难权重可以根据每个样本的特征向量与对应类别的参考特征向量之间的距离来确定,每个样本的特征向量与对应类别的参考特征向量之间的距离越大,表该样本的特征向量与对应类别的参考特征向量之间的相似度越小,该样本的困难权重越大,也就是说距离与困难权重之间呈正比例关系。
举例来说,如果第一类样本中每一个样本输入特征提取模型后获得的特征向量为B 1,B 2,…,B n,参考特征向量为向量A,那么可以根据特征向量B 1与参考特征向量A之间的距离确定特征向量B 1的困难权重,根据特征向量B 2与参考特征向量A之间的距离确定特征向量B 2的困难权重,…,根据特征向量B n与参考特征向量A之间的距离确定特征向量B n的困难权重。以此类推,可以根据每个样本与对应类别的参考特征向量之间的距离,确定每个样本的困难权重。
具体实现中,一个样本的特征向量与参考特征向量之间的距离可以是余弦距离(CosineDistance)、欧氏距离(Euclidean Distance)、曼哈顿距离(Manhattan Distance)、切比雪夫距离(Chebyshev Distance)、闵可夫斯基距离(Minkowski Distance)等等,一个样本的特征信息与参考特征信息之间的相似度可以是余弦相似度(Cosine Similarity)、调整余弦相似度(Adjusted Cosine Similarity)、皮尔森相关系数(Pearson Correlation Coefficient)杰卡德相似系数(Jaccard Coefficient)等等,本申请不作具体限定。
举例来说,某类样本的参考特征向量为A,特征向量为B i={B 1,B 2,…,B n},那么参考特征向量A与特征向量B i之间的距离公式D i(余弦距离)可以是:
Figure PCTCN2021091597-appb-000003
基于公式(4)可以确定每个样本的特征向量B i与参考特征向量A之间的距离D i。应理解,上述公式4仅用于举例说明,并不能构成具体限定。
参考图5实施例可知,用于提取样本特征的特征提取模型包括多个用于提取特定特征的权重矩阵,每一个权重矩阵都可以提取到特定的颜色、特定的边缘信息等等,因此对于简单样本来说,权重矩阵都可以很好的提取到特定的颜色、特定的边缘信息等等,不同的简单样本提取得到的特征向量十分类似;而对于困难样本,权重矩阵可能无法提取到特定的颜色、特定的边缘信息等等,因此困难样本提取得到的特征向量与简单样本提取到的特征向量差距很大。这样,通过确定每一个样本提取的特征向量与参考特征向量之间的距离,可以很好的确定该样本的困难程度,样本的特征向量与参考特征向量之间的距离越大,样本的特征向量与参考特征向量相似度越小,表示该样本越属于困难样本,困难权重也就越大,相反,样本的特征向量与参考特征向量之间的距离越小,样本的特征向量与参考特征向量相似度越大,表示该样本越属于简单样本,困难权重也就越小。因此,困难权重α i的公式可以是:
α i=T×D i i=1,2,…,n        (5)
其中,T为大于1的常量,应理解,上述公式5仅用于说明,困难权重α的公式可以是其他困难权重α与距离D呈正比例关系的公式,本申请不作具体限定。
同理可知,如果样本的困难权重α是根据样本的特征信息与陈坤特征信息之间的相似度S来确定的,困难权重的公式可以是:
α i=T-S i i=1,2,…,n        (6)
应理解,上述公式6仅用于说明,困难权重α的公式可以是其他困难权重α与相似度S呈反比例关系的公式,本申请不作具体限定。
在一实施例中,公式5和公式6中,困难权重α i中的常量T可以是可调的常量,具体地,在训练待训练模型的初期阶段,T可以是一个较大的常量,使得困难样本的困难权重更高,损失函数的值更大,待训练模型的学习重心越偏向于困难样本。在训练待训练模型的末期,T可以适当变小,因为此时AI模型已经趋向于收敛,可以不再需要偏向于耗时较多的困难样本,从而提高训练速度。
步骤S224:根据第一样本集中每个样本的困难权重,获得第一样本集中样本的困难权重分布。
可以理解的,用特征提取模型提取样本集内每个样本的特征向量和每类样本的向量,再根据每个样本的特征向量与对应类别的向量之间的相似度或者距离,确定每个样本的困难权重,这样获得的第一样本集的困难权重分布是基于样本本身的特征获得的,与训练模型的结构以及训练使用的方法无关,可以很好的反映出样本的困难程度,困难样本标注的精度很高,从而解决了由于困难样本难以标注使得AI模型的训练精度出现瓶颈的问题。
在一实施例中,训练设备获得第一样本集中样本的困难权重分布之后,可以将第一样本集的困难权重分布也存储在数据库130中,这样,当数据库130中存储了许多样本集的困难权重分布之后,如果训练设备需要训练AI模型时,根据待训练的AI模型的任务目标确定目标困难权重分布之后,可以直接从数据库130中获取接近于目标困难权重分布的样本集对待训练的AI模型进行训练。举例来说,数据库130存储3个样本集,分别为样本集X1、X2以及X3,数据库130还存储有样本集X1的困难权重分布Y1=1:1,样本集X2的困难权重分布Y2=1:2,以及样本集X3的困难权重分布Y3=1:5,训练设备200可以根据待训练模型的任务目标以及前述内容中的对应关系库,获得与该任务目标对应的目标困难权重分布Y0=1:6,然后在数据库130中获取困难权重分布与目标困难权重分布Y0最接近的样本集,也就是样本集X3。这样,训练设备200可以不用执行步骤S230调整困难权重分布,直接选择与目标困难权重分布相同或者相似的样本集作为第二样本集,对待训练的AI模型进行训练,进一步提高AI模型的训练速度。
综上可知,本申请提供了一种模型训练方法,可以在对待训练模型进行训练之前,先确定第一样本集中样本的困难权重分布,然后根据待训练模型的任务目标和上述困难权重分布,对第一样本集进行调整,获得第二样本集,最后使用第二样本集对待训练模型进行训练。这样,训练设备200在训练待训练模型的过程中,可以结合待训练模型的任务目标的复杂程度和每个样本的困难权重,选择合适数量的困难样本进行训练,解决了困难样本难以标注导致AI模型的训练精度出现瓶颈的问题,使得AI模型的训练精度得到提升。
下面结合图8,对本申请提供的训练方法进行举例说明。如图8所示,假设当前待训练模型的任务目标为室外视频监控场景中的人脸识别,是一个较为复杂的任务场景,用于训练待训练模型的第一样本集包括两类样本,第一类样本是ID1的人脸图像(比如人物Ann 在各个角度的人脸图像),包括样本X11~X14,第二类样本是ID2的人脸图像(比如人物Lisa在各个角度的人脸图像),包括样本X21~X24,一共有8个样本。在这一应用场景下,如图8所示,本申请提供的训练方法包括以下步骤:
步骤1、将第一样本集的每类样本中的每一个样本输入特征提取模型,获得每个样本的特征向量。其中,特征提取模型用于提取人脸特征。如图8所示,将样本X11~X14输入特征提取模型可以获得特征向量A11~A14,将样本X21~X24输入特征提取模型可以获得特征向量A21~A24。具体可以参考前述内容的步骤S221,这里不展开赘述。
步骤2、确定第一样本集每类样本的参考特征向量。其中,每类样本的参考特征向量可以是每类样本的特征向量的平均值,也可以是最接近该平均值的一个特征向量,还可以是将每类样本的特征向量映射到2D空间之后,将分布最密集的区域的点对应的特征向量确定为该类样本的参考特征信息,本申请不对参考特征信息的确定方法进行限定。图8以最接近平均值的一个特征向量为例进行了说明,比如图8所示的参考特征向量A14和参考特征向量A21。具体可以参考前述内容的步骤S222,这里不展开赘述。
步骤3、确定每个特征向量与对应类别的参考特征向量之间的距离。如图8所示,可以计算特征向量A14与A11之间的距离D11,特征向量A14与A12之间的距离D12,特征向量A13与A14之间的距离D13,特征向量A14与A14之间的距离为0,同理,可以计算出第二类样本中特征向量A21与A22之间的距离D21,特征向量A21与A23之间的距离D22,特征向量A21与A24之间的距离D23,特征向量A21与A21之间的距离为0。其中,距离可以是前述内容中的余弦距离、欧式距离、曼哈顿距离、切比雪夫距离以及曼哈顿距离中的任一种,本申请不作具体限定。该步骤可以参考前述内容中的步骤S223及其可选步骤,这里不展开赘述。
步骤4、确定第一样本集的每个样本的困难权重α,获得第一样本集的困难权重分布。困难权重的公式可以参考公式5,即α 11=T×D11,α 12=T×D12,以此类推,可以获得8个样本中每个样本的困难权重如图8所示,其中,困难权重大于第一阈值h 1的样本用深色表示,也就是样本X11和样本X22的困难权重高于阈值。该步骤可以参考前述内容中的步骤S224及其可选步骤,这里不展开赘述。
步骤5、根据待训练模型的任务目标确定待训练模型的目标困难权重,根据该目标困难权重调整第一样本集的困难权重分布,获得第二样本集。如图8所示,第一样本集的困难权重分布为困难样本:非困难样本=1:3,假设任务目标对应的目标困难权重为困难样本:非困难样本=3:1,但是由于第一样本集困难样本只有两个,也就是X11和X22,因此需要通过数据增量方法对困难样本进行扩充,使得扩充后的困难样本(6个)与非困难样本(2个)之间的数量比达到3:1,从而获得用于训练待训练模型的第二样本集。该步骤可以参考前述内容中的步骤S230及其可选步骤,这里不展开赘述。
步骤6、使用第二样本集对待训练模型进行训练。其中,待训练模型的损失函数可以如公式3所示,该损失函数使得待训练模型的训练过程中,困难权重大的困难样本对损失函数的影响加大,进而使得待训练模型可以更加集中于学习困难样本的特征,更加倾向于利用困难样本进行参数更新,从而达到强化学习的目的。并且,可以在训练初期将困难权重中的常量T设置为较高值,使得困难样本在训练待训练模型时的影响力达到最高,然后 在训练末期将困难权重中的常量T设置为较低值,此时待训练模型已经趋向于收敛,可以不再需要偏向于耗时较多的困难样本,从而提高训练速度。该步骤可以参考前述内容中的步骤S240及其可选步骤,这里不展开赘述。
上述训练方法通过使用特征提取模型对第一样本集中的每一个样本进行特征提取,根据第一样本集的每一个样本提取到的特征向量确定同类样本的参考特征向量,再根据同类样本中每一个样本的特征向量与参考特征向量之间的距离,确定每一个样本的困难权重,然后根据该困难权重调整第一样本集的困难权重分布,使用调整困难权重分布后的第二样本集对待训练模型进行训练,这样,训练设备200在训练待训练模型的过程中,可以结合待训练模型的任务目标的复杂程度和每个样本的困难权重,选择合适数量的困难样本进行训练,解决了困难与安保难以标注导致的AI模型训练精度出现瓶颈的问题,使得AI模型的训练精度得到提升。并且,由于待训练模型的损失函数的权重参数根据第二样本集的困难权重分布进行了调整,该损失函数的值与困难权重呈正比例关系,样本的困难权重越大,使用该样本训练待训练模型获得的损失函数值越大,使得待训练模型可以更加集中于学习困难样本的特征,从而达到针对困难样本的强化训练效果,进一步提高AI模型的预测精度。
上述详细阐述了本申请实施例的方法,为了便于更好的实施本申请实施例上述方案,相应地,下面还提供用于配合实施上述方案的相关设备。
图9为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器50。该芯片可以被设置在前述内容中的训练设备200、训练设备200中,用以完成训练单元240的训练工作以及提取模块211的特征提取工作。如图5所示的卷积神经网络中各层的算法均可在如图9所示的芯片中得以实现。
需要说明的,神经网络处理器(Neural-network Processing Unit,NPU)900可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU800分配任务,主CPU800就像管理者,负责判断哪些数据需要由NPU核来执行,从而发出指令触发NPU900进行数据的处理。NPU900还可以集成到CPU,比如麒麟970,也可以作为一个单独的芯片。NPU900的核心部分为运算电路903,通过控制器904控制运算电路903提取存储器中的矩阵数据并进行乘法运算,比如图5实施例中的卷积运算。
在一些实现中,运算电路903内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路903是二维脉动阵列。运算电路903还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路903是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器902中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器901中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(Accumulator)908中。
统一存储器906用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)905被搬运到权重存储器902中。输入数据也通过DMAC被搬运到统一存储器906中。
总线接口单元(Bus Interface Unit,BIU)910用于通过总线协议(Advanced eXtensible Interface,AXI)与存储单元访问控制器905和取指存储器(Instruction Fetch Buffer,IFB)909交互。
总线接口单元910,用于供取指存储器909从外部存储器获取指令,还用于存储单元访问控制器905从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
存储单元访问控制器905主要用于将外部存储器中的输入数据搬运到统一存储器906或将权重数据搬运到权重存储器902中或将输入数据搬运到输入存储器901中。
向量计算单元907包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘法,向量加法,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。
在一些实现中,向量计算单元能907将经过处理的输出向量存储到统一缓存器906。例如,向量计算单元907可以将非线性函数应用到运算电路903的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元907生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路903的激活输入,例如用于在神经网络中的后续层中的使用。
控制器904连接的取指存储器(Instruction Fetch Buffer)909,用于存储控制器904使用的指令;控制器904,用于调用取指存储器909中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器906,输入存储器901,权重存储器902以及取指存储器909均为片上存储器(On-chip Memory)。外部存储器私有于该NPU硬件架构。该外部存储器可以为双倍数据率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM)、高带宽存储器(High Bandwidth Memory,HBM)或其他可读可写的存储器。
图10是本申请提供的一种计算设备的硬件结构示意图。其中,计算设备1000可以是图2-图10实施例中的训练设备200。如图10所示,计算设备1000包括:处理器1010、通信接口1020、存储器1030以及神经网络处理器1050。其中,处理器1010、通信接口1020、存储器1030以及神经网络处理器1050可以通过内部总线1040相互连接,也可通过无线传输等其他手段实现通信。本申请实施例以通过总线1040连接为例,总线1040可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。所述总线1040可以分为地址总线、数据总线、控制总线等。为便于表示,图10中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
所述处理器1010可以由至少一个通用处理器构成,例如中央处理器(Central Processing Unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(Application-Specific Integrated Circuit,ASIC)、可编程逻辑器件(Programmable Logic Device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(Complex Programmable  Logic Device,CPLD)、现场可编程逻辑门阵列(Field-Programmable Gate Array,FPGA)、通用阵列逻辑(Generic Array Logic,GAL)或其任意组合。处理器1010执行各种类型的数字存储指令,例如存储在存储器1030中的软件或者固件程序,它能使计算设备1000提供较宽的多种服务。
所述存储器1030用于存储程序代码,并由处理器1010来控制执行,以执行上述图2-图8中任一实施例中训练设备200的处理步骤。所述程序代码中可以包括一个或多个软件模块。这一个或多个软件模块可以为图3所示实施例中提供的软件模块,如获取单元、确定单元、调整单元和训练单元,其中,获取单元可以用于获取第一样本集,确定单元可以用于确定第一样本集的困难权重分布,调整单元可以用于根据第一样本集的每个样本对应的困难权重以及待训练模型的任务目标,调整第一样本集的困难权重分布获得第二样本集,训练单元可以用于使用第二样本集的待训练模型进行训练,具体可用于执行前述方法的步骤S210-步骤S230及其可选步骤、步骤1-步骤6及其可选步骤,还可以用于执行图2-图8实施例描述的其他由训练设备执行的步骤,这里不再进行赘述。
需要说明的是,本实施例可以是通用的物理服务器实现的,例如,ARM服务器或者X86服务器,也可以是基于通用的物理服务器结合NFV技术实现的虚拟机实现的,所述虚拟机指通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的完整计算机系统,本申请不作具体限定。
神经网络处理器1050可以用于通过存储器1030的训练程序以及样本数据得到推理模型,以执行本文讨论方法的至少一部分,其中,神经网络处理器1050的硬件结构具体可参考图9。
所述存储器1030可以包括易失性存储器(Volatile Memory),例如随机存取存储器(Random Access Memory,RAM);存储器1030也可以包括非易失性存储器(Non-Volatile Memory),例如只读存储器(Read-Only Memory,ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD);存储器1030还可以包括上述种类的组合。存储器1030可以存储有第一样本集和/或第二样本集,存储器1030可以存储有程序代码,具体可以包括用于执行图2-图8实施例描述的其他步骤的程序代码,这里不再进行赘述。
通信接口1020可以为有线接口(例如以太网接口),可以为内部接口(例如高速串行计算机扩展总线(Peripheral Component Interconnect express,PCIe)总线接口)、有线接口(例如以太网接口)或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与与其他设备或模块进行通信。
需要说明的,图10仅仅是本申请实施例的一种可能的实现方式,实际应用中,所述计算设备还可以包括更多或更少的部件,这里不作限制。关于本申请实施例中未示出或未描述的内容,可参见前述图2-图8所述实施例中的相关阐述,这里不再赘述。
应理解,图10所示的计算设备还可以是至少一个服务器构成的计算机集群,本申请不作具体限定。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在处理器上运行时,图2-图8所示的方法流程得以实现。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在处理器上运行时,图2-图8所示的方法流程得以实现。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括至少一个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含至少一个可用介质集合的服务器、数据等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(Digital Video Disc,DVD)、或者半导体介质。半导体介质可以是SSD。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (13)

  1. 一种训练方法,其特征在于,所述方法包括:
    获取第一样本集,所述第一样本集包括多个样本;
    确定所述第一样本集中样本的困难权重分布;
    根据待训练模型的任务目标和所述第一样本集中样本的困难权重分布,调整所述第一样本集,获得第二样本集;
    利用所述第二样本集,对所述待训练模型进行训练。
  2. 根据权利要求1所述的方法,其特征在于,所述待训练模型的任务目标包括所述待训练模型经训练完成后的应用场景、所述待训练模型经训练完成后需实现的事件类型以及所述待训练模型的训练精度目标中的一种或者多种。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据待训练模型的任务目标和所述第一样本集中样本的困难权重分布,调整所述第一样本集,获得第二样本集,包括:
    根据所述待训练模型的任务目标和所述第一样本集中样本的困难权重分布,确定用于训练所述待训练模型的样本集应达到的目标困难权重分布;
    增加或减少所述第一样本集中的样本数量,或者,改变所述第一样本集中部分样本的困难权重,获得第二样本集,其中,所述第二样本集中样本的困难权重分布等于或者近似于所述目标困难权重分布。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述确定所述第一样本集中样本的困难权重分布包括:
    将所述第一样本集的每个样本输入至特征提取模型,获得所述每个样本的特征信息,其中,所述每个样本对应一个类别;
    根据所述每个样本的特征信息,确定所述第一样本集中的多类样本的参考特征信息,其中,每类样本包括至少一个类别相同的样本;
    基于所述每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定所述每个样本对应的困难权重;
    根据所述第一样本集中每个样本的困难权重,获得所述第一样本集中样本的困难权重分布。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,在利用所述第二样本集,对所述待训练模型进行训练之前,所述方法还包括:
    根据所述第二样本集中样本的困难权重分布,调整所述待训练模型的损失函数的权重参数。
  6. 一种训练装置,其特征在于,所述装置包括:
    获取单元,用于获取第一样本集,所述第一样本集包括多个样本;
    确定单元,用于确定所述第一样本集中样本的困难权重分布;
    调整单元,用于根据待训练模型的任务目标和所述第一样本集中样本的困难权重分布,调整所述第一样本集,获得第二样本集;
    训练单元,用于利用所述第二样本集,对所述待训练模型进行训练。
  7. 根据权利要求6所述的装置,其特征在于,所述待训练模型的任务目标包括所述待训练模型经训练完成后的应用场景、所述待训练模型经训练完成后需实现的事件类型以及所述待训练模型的训练精度目标中的一种或者多种。
  8. 根据权利要求6或7所述的装置,其特征在于,
    所述调整单元具体用于:
    根据所述待训练模型的任务目标和所述第一样本集中样本的困难权重分布,确定用于训练所述待训练模型的样本集应达到的目标困难权重分布;
    增加或减少所述第一样本集中的样本数量,或者,改变所述第一样本集中部分样本的困难权重,获得第二样本集,其中,所述第二样本集中样本的困难权重分布等于或者近似于所述目标困难权重分布。
  9. 根据权利要求6至8任一项所述的装置,其特征在于,
    所述确定单元具体用于:
    将所述第一样本集的每个样本输入至特征提取模型,获得所述每个样本的特征信息,其中,所述每个样本对应一个类别;
    根据所述每个样本的特征信息,确定所述第一样本集中的多类样本的参考特征信息,其中,每类样本包括至少一个类别相同的样本;
    基于所述每个样本的特征信息与对应类别的参考特征信息之间的相似度,确定所述每个样本对应的困难权重;
    根据所述第一样本集中每个样本的困难权重,获得所述第一样本集中样本的困难权重分布。
  10. 根据权利要求6至9任一项所述的装置,其特征在于,在利用所述第二样本集,对所述待训练模型进行训练之前,所述训练单元还用于:根据所述第二样本集中样本的困难权重分布,调整所述待训练模型的损失函数的权重参数。
  11. 一种计算机可读存储介质,其特征在于,包括指令,当所述指令在计算设备上运行时,使得所述计算设备执行如权利要求1至5任一权利要求所述的方法。
  12. 一种计算设备,其特征在于,包括处理器和存储器,所述处理器执行所述存储器中的代码执行如权利要求1至5任一权利要求所述的方法。
  13. 一种计算机程序产品,其特征在于,包括计算机程序,当所述计算机程序被计算设备读取并执行时,使得所述计算设备行如权利要求1至5任一权利要求所述的方法。
PCT/CN2021/091597 2020-05-27 2021-04-30 一种训练方法、装置、设备以及计算机可读存储介质 WO2021238586A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010462418.X 2020-05-27
CN202010462418.XA CN113743426A (zh) 2020-05-27 2020-05-27 一种训练方法、装置、设备以及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021238586A1 true WO2021238586A1 (zh) 2021-12-02

Family

ID=78723784

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091597 WO2021238586A1 (zh) 2020-05-27 2021-04-30 一种训练方法、装置、设备以及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN113743426A (zh)
WO (1) WO2021238586A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666882A (zh) * 2022-04-25 2022-06-24 浙江省通信产业服务有限公司 一种功率控制方法、装置、基站及存储介质
CN115392365A (zh) * 2022-08-18 2022-11-25 腾讯科技(深圳)有限公司 多模态特征的获取方法、装置及电子设备
CN116503923A (zh) * 2023-02-16 2023-07-28 深圳市博安智控科技有限公司 训练人脸识别模型的方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150006151A1 (en) * 2013-06-28 2015-01-01 Fujitsu Limited Model learning method
CN108229555A (zh) * 2017-12-29 2018-06-29 深圳云天励飞技术有限公司 样本权重分配方法、模型训练方法、电子设备及存储介质
CN109816092A (zh) * 2018-12-13 2019-05-28 北京三快在线科技有限公司 深度神经网络训练方法、装置、电子设备及存储介质
CN109840588A (zh) * 2019-01-04 2019-06-04 平安科技(深圳)有限公司 神经网络模型训练方法、装置、计算机设备及存储介质
CN110516737A (zh) * 2019-08-26 2019-11-29 南京人工智能高等研究院有限公司 用于生成图像识别模型的方法和装置
CN111582365A (zh) * 2020-05-06 2020-08-25 吉林大学 一种基于样本难度的垃圾邮件分类方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150006151A1 (en) * 2013-06-28 2015-01-01 Fujitsu Limited Model learning method
CN108229555A (zh) * 2017-12-29 2018-06-29 深圳云天励飞技术有限公司 样本权重分配方法、模型训练方法、电子设备及存储介质
CN109816092A (zh) * 2018-12-13 2019-05-28 北京三快在线科技有限公司 深度神经网络训练方法、装置、电子设备及存储介质
CN109840588A (zh) * 2019-01-04 2019-06-04 平安科技(深圳)有限公司 神经网络模型训练方法、装置、计算机设备及存储介质
CN110516737A (zh) * 2019-08-26 2019-11-29 南京人工智能高等研究院有限公司 用于生成图像识别模型的方法和装置
CN111582365A (zh) * 2020-05-06 2020-08-25 吉林大学 一种基于样本难度的垃圾邮件分类方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666882A (zh) * 2022-04-25 2022-06-24 浙江省通信产业服务有限公司 一种功率控制方法、装置、基站及存储介质
CN114666882B (zh) * 2022-04-25 2024-01-02 浙江省通信产业服务有限公司 一种功率控制方法、装置、基站及存储介质
CN115392365A (zh) * 2022-08-18 2022-11-25 腾讯科技(深圳)有限公司 多模态特征的获取方法、装置及电子设备
CN115392365B (zh) * 2022-08-18 2024-04-26 腾讯科技(深圳)有限公司 多模态特征的获取方法、装置及电子设备
CN116503923A (zh) * 2023-02-16 2023-07-28 深圳市博安智控科技有限公司 训练人脸识别模型的方法及装置
CN116503923B (zh) * 2023-02-16 2023-12-08 深圳市博安智控科技有限公司 训练人脸识别模型的方法及装置

Also Published As

Publication number Publication date
CN113743426A (zh) 2021-12-03

Similar Documents

Publication Publication Date Title
WO2020238293A1 (zh) 图像分类方法、神经网络的训练方法及装置
US20220019855A1 (en) Image generation method, neural network compression method, and related apparatus and device
US20220108546A1 (en) Object detection method and apparatus, and computer storage medium
US20210012198A1 (en) Method for training deep neural network and apparatus
WO2021238281A1 (zh) 一种神经网络的训练方法、图像分类系统及相关设备
WO2021238586A1 (zh) 一种训练方法、装置、设备以及计算机可读存储介质
WO2019228317A1 (zh) 人脸识别方法、装置及计算机可读介质
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2022033150A1 (zh) 图像识别方法、装置、电子设备及存储介质
CN109002766B (zh) 一种表情识别方法及装置
WO2021043112A1 (zh) 图像分类方法以及装置
WO2020192736A1 (zh) 物体识别方法及装置
US9906704B2 (en) Managing crowd sourced photography in a wireless network
WO2020228525A1 (zh) 地点识别及其模型训练的方法和装置以及电子设备
US20220375213A1 (en) Processing Apparatus and Method and Storage Medium
WO2021139324A1 (zh) 图像识别方法、装置、计算机可读存储介质及电子设备
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
US11494886B2 (en) Hierarchical multiclass exposure defects classification in images
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2019033715A1 (zh) 人脸图像数据采集方法、装置、终端设备及存储介质
CN110222718B (zh) 图像处理的方法及装置
WO2020187160A1 (zh) 基于级联的深层卷积神经网络的人脸识别方法及系统
US20220198836A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
WO2023206944A1 (zh) 一种语义分割方法、装置、计算机设备和存储介质
WO2022111387A1 (zh) 一种数据处理方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21814409

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21814409

Country of ref document: EP

Kind code of ref document: A1