CN113807399A

CN113807399A - Neural network training method, neural network detection method and neural network detection device

Info

Publication number: CN113807399A
Application number: CN202110939696.4A
Authority: CN
Inventors: 彭凤超; 王超; 刘健庄; 杨臻
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-12-17
Anticipated expiration: 2041-08-16
Also published as: CN113807399B

Abstract

The application discloses a neural network training method and device in the field of artificial intelligence, which are used for obtaining multiple neural networks by combining knowledge distillation, selecting at least one unmarked sample to mark through deviation between outputs of the multiple neural networks, and training again by using the marked sample, thereby improving the training efficiency of a model. The method comprises the following steps: the method comprises the steps of respectively inputting samples in a data set as a teacher model, a first student model and a second student model to obtain a first output result, a second output result and a third output result, screening out at least one unmarked sample from the data set through a first deviation and a second deviation for marking, updating at least one marked sample into the training set, training the teacher model by using the updated training set to obtain a trained teacher model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain the trained second student model.

Description

Neural network training method, neural network detection method and neural network detection device

Technical Field

The application relates to the field of artificial intelligence, in particular to a neural network training method, a neural network detection method and a neural network detection device.

Background

In the field of artificial intelligence, computer vision technology is widely applied in scenes such as automatic driving, mobile phone terminals and security protection. The active learning is a typical data selection strategy, is used for reducing the cost of data annotation, and means that a part of data requests for annotation are automatically selected from a data set through an automatic machine learning algorithm. In general, the conventional active learning cannot play a role in some scenes with much noise, and the output effect of the learned model is poor. Therefore, how to obtain a model with better output effect becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a neural network training method and device, which are used for obtaining various neural networks by combining knowledge distillation, selecting at least one unmarked sample for marking through the output of the various neural networks, and using the marked sample for retraining, so that the output effect of a model is improved.

In view of the above, in a first aspect, the present application provides a neural network training method, including: the method comprises the steps of taking a sample in a data set as input of a teacher model to obtain a first output result, taking the sample in the data set as input of a first student model to obtain a second output result, taking the sample in the data set as input of a second student model to obtain a third output result, wherein the sample in the data set does not carry a label, the teacher model and the first student model are obtained by training through a training set, the second student model is obtained by carrying out knowledge distillation through the teacher model, and the training set comprises samples carrying labels; screening out at least one unlabeled sample from the dataset through a first deviation and a second deviation, wherein the first deviation comprises a distance between a first output result and a third output result, and the second deviation comprises a distance between a second output result and a third output result; obtaining at least one marked sample, updating the at least one marked sample into a training set to obtain an updated training set, and adding a label to at least one unmarked sample to obtain the at least one marked sample; and training the teacher model and the first student model by using the updated training set to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

Therefore, in the embodiment of the application, the student models obtained through multiple modes include training by using the training set and knowledge distillation by using the teacher model, so that the student models and the teacher model with different output capabilities are obtained, the output results of the unlabelled samples are output through the models with different output capabilities, and the samples more suitable for model training are screened by combining the output differences of the different models, so that the training effect of the models can be improved, and the output effect of the models is improved. And moreover, the labeling of useless samples is reduced, and the labeling cost can be reduced.

In a possible embodiment, the aforementioned screening out at least one unlabeled sample from the dataset by the first deviation and the second deviation may include: calculating the uncertainty of the sample in the data set according to the first deviation and the second deviation, and not determining the size of the information quantity included in the sample in the data set; at least one unlabeled sample is screened from the dataset according to the uncertainty.

In the embodiments of the present application, the uncertainty of the sample may be calculated from the deviation between the outputs of the different models. The learning ability difference of different models can be utilized, high noise samples and high uncertainty samples are effectively distinguished, then the uncertainty of the sample is evaluated, the sample needing to be labeled is screened out, so that the subsequent sample after labeling can be used for learning, and the model can learn the information included by the sample.

In one possible embodiment, screening the data set for at least one unlabeled sample based on the uncertainty can include: calculating a diversity metric for the samples in the data set from the second output result, the diversity metric being indicative of the diversity of the samples in the data set relative to the data set; and screening out at least one unlabeled sample from the data set according to the uncertainty and the diversity measure.

Therefore, in the embodiment of the application, samples to be labeled can be screened by combining diversity, and a sample set with high difficulty and strong scene diversity can be screened out. And then promote the model training effect, reduce artifical mark volume, practice thrift the mark cost.

In one possible embodiment, screening the data set for at least one unlabeled sample based on the uncertainty and the diversity metric may include: carrying out weighted fusion on the uncertainty and the diversity measurement of each sample in the data set to obtain the score of each sample; and screening out at least one unmarked sample from the data set according to the score of each sample.

Therefore, in the embodiment of the application, the sample set with high difficulty and strong scene diversity can be screened out by fusing uncertainty and diversity measurement, so that the neural network can learn more effective information, and the training efficiency and the output effect of the model are effectively improved.

In one possible implementation, calculating an uncertainty of a sample in the data set from the first deviation and the second deviation may include: and according to the ratio of the first deviation to the second deviation, fusing the first deviation and the second deviation to obtain the uncertainty of the sample in the data set.

Therefore, in the embodiment of the application, the coefficient fusing the first deviation and the second deviation can be determined according to the ratio of the deviations, so that different uncertainties can be obtained according to different samples, and the uncertainty of the sample which can be measured more accurately can be obtained.

In a possible embodiment, the aforementioned second output result includes features extracted by the first student model from a sample of the data set, and the first sample is any one sample in the data set; the aforementioned calculating a diversity metric for the samples in the dataset according to the second output result may include: determining a plurality of inverse nearest neighbors of a first sample in the data set according to characteristics of the samples in the data set; a diversity metric for the first sample is calculated based on the plurality of inverse nearest neighbors.

Therefore, in the embodiment of the present application, the influence set of the samples can be calculated by calculating the inverse nearest neighbor, and the samples with high representativeness can be selected from the data set, so that the training efficiency of the model can be improved, and the output effect of the model can be improved.

In one possible embodiment, the samples in the training set are images including lane lines, and the teacher model, the first student model and the second student model are used for detecting the lane line information in the input images.

Therefore, the method and the device can be applied to various scenes such as lane line detection, target detection, classification or segmentation tasks and the like, and have strong generalization capability.

In a second aspect, the present application provides a detection method, comprising: acquiring an input image; taking the input image as the input of the trained second student model, and outputting a detection result; wherein, the second student model after training is obtained through initiatively studying to train the second student model, and the process of initiatively studying includes: the method comprises the steps of taking a sample in a data set as input of a teacher model to obtain a first output result, taking the sample in the data set as input of a first student model to obtain a second output result, taking the sample in the data set as input of a second student model to obtain a third output result, wherein the sample in the data set does not carry a label, the teacher model and the first student model are obtained by training through a training set, the second student model is obtained by carrying out knowledge distillation through the teacher model, and the training set comprises samples carrying labels; screening out at least one unlabeled sample from the dataset through a first deviation and a second deviation, wherein the first deviation comprises a distance between a first output result and a third output result, and the second deviation comprises a distance between a second output result and a third output result; obtaining at least one marked sample, wherein the at least one marked sample is obtained by adding a label to at least one unmarked sample; and training the teacher model and the first student model by using at least one sample with a label to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

In a possible embodiment, during active learning, the at least one unlabeled sample is filtered from the data set according to the uncertainty and a diversity metric, the diversity metric being indicative of a diversity of samples in the data set relative to the data set, the diversity metric being derived from the second output result.

In a possible embodiment, the at least one unlabeled sample is screened from the dataset according to the uncertainty and a diversity metric representing a diversity of the sample in the dataset with respect to the dataset, the diversity metric being derived from the second output result.

In a third aspect, an embodiment of the present application provides a neural network training device, where the neural network training device has a function of implementing the neural network training method in the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

The neural network training device may specifically include:

the input module is used for taking the samples in the data set as the input of a teacher model to obtain a first output result, taking the samples in the data set as the input of a first student model to obtain a second output result, taking the samples in the data set as the input of a second student model to obtain a third output result, wherein the samples in the data set do not carry labels, the teacher model and the first student model are obtained by training through a training set, the second student model is obtained by performing knowledge distillation through the teacher model, and the training set comprises samples carrying labels;

the screening module is used for screening out at least one unlabeled sample from the data set through a first deviation and a second deviation, the first deviation comprises a distance between a first output result and a third output result, and the second deviation comprises a distance between the second output result and the third output result;

the acquisition module is used for acquiring at least one labeled sample, updating the at least one labeled sample into a training set to obtain an updated training set, and adding a label to at least one unlabeled sample to obtain the at least one labeled sample;

and the training module is used for training the teacher model and the first student model by using the updated training set to obtain the trained teacher model and the trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain the trained second student model.

In a possible implementation, the screening module is specifically configured to: calculating the uncertainty of the sample in the data set according to the first deviation and the second deviation, and not determining the size of the information quantity included in the sample in the data set; at least one unlabeled sample is screened from the dataset according to the uncertainty.

In a possible implementation, the screening module is specifically configured to: calculating a diversity metric for the samples in the data set from the second output result, the diversity metric being indicative of the diversity of the samples in the data set relative to the data set; and screening out at least one unlabeled sample from the data set according to the uncertainty and the diversity measure.

In a possible implementation, the screening module is specifically configured to: carrying out weighted fusion on the uncertainty and the diversity measurement of each sample in the data set to obtain the score of each sample; and screening out at least one unmarked sample from the data set according to the score of each sample.

In a possible implementation, the screening module is specifically configured to: and according to the ratio of the first deviation to the second deviation, fusing the first deviation and the second deviation to obtain the uncertainty of the sample in the data set.

In one possible embodiment, the second output result includes features extracted by the first student model from a sample of the data set, the first sample being any one of the samples in the data set;

in a possible implementation, the screening module is specifically configured to: determining a plurality of inverse nearest neighbors of a first sample in the data set according to characteristics of the samples in the data set; a diversity metric for the first sample is calculated based on the plurality of inverse nearest neighbors.

In a fourth aspect, an embodiment of the present application provides a detection apparatus having a function of implementing the detection method of the second aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

The detection device may specifically include:

the input module is used for acquiring an input image;

the output module is used for taking the input image as the input of the trained second student model and outputting a detection result;

the process of obtaining the trained second student model may refer to the steps shown in the foregoing first aspect or any optional implementation manner of the first aspect, and details are not described here.

In a fifth aspect, an embodiment of the present application provides a neural network training device, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory to execute the processing-related functions of the neural network training method according to any one of the first aspect. Alternatively, the neural network training device may be a chip.

In a sixth aspect, an embodiment of the present application provides a detection apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected by a line, and the processor calls the program code in the memory for executing the processing-related function in the detection method according to any one of the second aspect. Alternatively, the detection means may be a chip.

In a seventh aspect, an embodiment of the present application provides a neural network training device, where the detecting device may also be referred to as a digital processing chip or a chip, and the chip includes a processing unit and a communication interface, where the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to perform functions related to processing in the foregoing first aspect or any optional implementation manner of the first aspect.

In an eighth aspect, embodiments of the present application provide a detection apparatus, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute functions related to processing in any of the above-described second aspect or the second aspect.

In a ninth aspect, embodiments of the present application provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method in any of the optional implementation manners of the first aspect or the second aspect.

In a tenth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method in any of the optional embodiments of the first or second aspects.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence body framework for use in the present application;

FIG. 2 is a system architecture diagram provided herein;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another convolutional neural network structure provided in the embodiments of the present application;

FIG. 5 is a schematic diagram of another system architecture provided herein;

fig. 6 is a schematic flowchart of a neural network training method according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart illustrating another neural network training method provided in an embodiment of the present application;

fig. 8 is a schematic flowchart of a training phase in a neural network training method according to an embodiment of the present disclosure;

fig. 9 is a schematic flowchart of a screening stage in a neural network training method according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a distribution manner of nearest neighbors according to an embodiment of the present application;

fig. 11 is a schematic flowchart of a detection method according to an embodiment of the present application;

FIG. 12 is a schematic diagram illustrating comparison of output effects provided by embodiments of the present application;

FIG. 13 is a schematic diagram illustrating comparison of output effects provided by embodiments of the present application;

fig. 14 is a schematic structural diagram of a neural network training device according to an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a detection apparatus according to an embodiment of the present disclosure;

FIG. 16 is a schematic structural diagram of another neural network training device provided in an embodiment of the present application;

FIG. 17 is a schematic structural diagram of another detecting device provided in the embodiments of the present application;

fig. 18 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The neural network training method provided by the application can be applied to Artificial Intelligence (AI) scenes. AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by an intelligent chip, such as a Central Processing Unit (CPU), a Network Processor (NPU), a Graphic Processor (GPU), an Application Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA), or other hardware acceleration chip; the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphics, images, voice, video and text, and also relates to internet of things data of traditional equipment, including service data of an existing system and sensing data of force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general-purpose capabilities can be formed, such as an algorithm or a general-purpose system, for example, translation, text analysis, computer vision processing (e.g., image recognition, object detection, etc.), voice recognition, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent house, intelligent medical treatment, intelligent security protection, autopilot, smart city, intelligent terminal etc..

Referring to fig. 2, a system architecture 200 is provided in an embodiment of the present application. The system architecture includes a database 230 and a client device 240. The data collection device 260 is used to collect data and store it in the database 230, and the training module 202 generates the target model/rule 201 based on the data maintained in the database 230. How the training module 202 obtains the target model/rule 201 based on the data will be described in more detail below, and the target model/rule 201 is a neural network trained in the following embodiments of the present application, and refer to the following description in fig. 6 to fig. 13.

The calculation module may include a training module 202, and the target model/rule output by the training module 202 may be applied in different systems or devices. In fig. 2, the execution device 210 configures a transceiver 212, the transceiver 212 may be a wireless transceiver, an optical transceiver, a wired interface (such as an I/O interface), or the like, and performs data interaction with an external device, and a "user" may input data to the transceiver 212 through the client device 240, for example, in the following embodiments of the present application, the client device 240 may transmit a target task to the execution device 210, request the execution device to train a neural network, and transmit a database for training to the execution device 210.

The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.

The calculation module 211 processes the input data using the target model/rule 201. Specifically, the calculation module 211 is configured to: in conjunction with the process of active learning, the teacher model is used to iteratively distill the second student model a plurality of times, wherein each iterative process may include: the method comprises the steps of taking a sample in a data set as input of a teacher model to obtain a first output result, taking the sample in the data set as input of a first student model to obtain a second output result, taking the sample in the data set as input of a second student model to obtain a third output result, wherein the sample in the data set does not carry a label, the teacher model and the first student model are obtained by training through a training set, the second student model is obtained by carrying out knowledge distillation through the teacher model, and the training set comprises samples carrying labels; screening out at least one unlabeled sample from the dataset through a first deviation and a second deviation, wherein the first deviation comprises a distance between a first output result and a third output result, and the second deviation comprises a distance between a second output result and a third output result; obtaining at least one labeled sample, updating the at least one labeled sample into a training set to obtain an updated training set, and adding a label to at least one unlabeled sample to obtain the at least one labeled sample; and training the teacher model and the first student model by using the updated training set to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

Finally, the transceiver 212 returns the trained neural network to the client device 240 for deployment in the client device 240 or other device.

Further, the training module 202 may derive corresponding target models/rules 201 based on different data for different tasks to provide better results to the user.

In the case shown in fig. 2, the data entered into the execution device 210 may be determined from input data of a user, for example, who may operate in an interface provided by the transceiver 212. Alternatively, the client device 240 may automatically input data to the transceiver 212 and obtain the result, and if the client device 240 automatically inputs data to obtain authorization from the user, the user may set corresponding rights in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also act as a data collector to store collected data associated with the target task in the database 230.

It should be noted that fig. 2 is only an exemplary schematic diagram of a system architecture provided by an embodiment of the present application, and a positional relationship between devices, modules, and the like shown in the diagram does not constitute any limitation. For example, in FIG. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other scenarios, the data storage system 250 may be disposed in the execution device 210.

The training, distillation, or updating processes referred to herein may be performed by the training module 202. It will be appreciated that the training process of the neural network is the way in which the control space transformation, and more particularly the weight matrix, is learned. The purpose of training the neural network is to make the output of the neural network as close to an expected value as possible, so that the weight vector of each layer of the neural network in the neural network can be updated according to the difference between the predicted value and the expected value of the current network by comparing the predicted value and the expected value of the current network (of course, the weight vector can be initialized before the first update, that is, parameters are configured in advance for each layer in the deep neural network). For example, if the predicted value of the network is too high, the values of the weights in the weight matrix are adjusted to reduce the predicted value, with constant adjustment until the value of the neural network output approaches or equals the desired value. Specifically, the difference between the predicted value and the expected value of the neural network may be measured by a loss function (loss function) or an objective function (objective function). Taking the loss function as an example, the higher the output value (loss) of the loss function indicates the larger the difference, and the training of the neural network can be understood as the process of reducing the loss as much as possible.

As shown in fig. 2, a target model/rule 201 is obtained by training according to a training module 202, and the target model/rule 201 may be a student model mentioned in the present application in the embodiment of the present application. Of course, the target model/rule 201 may also be understood as a trained student model at the stage of training the teacher model.

The neural network (e.g., teacher model or student model) referred to in the present application may include various types, such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), residual neural network or other neural network, and the like.

Exemplarily, the neural network provided in the present application is exemplarily described below by taking a Convolutional Neural Network (CNN) as an example.

CNN is a deep neural network with a convolutional structure. CNN is a deep learning (deep learning) architecture, which refers to learning at multiple levels at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial model in the training process, so that the reconstruction error loss of the model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is an error-loss dominated back propagation motion aimed at obtaining optimal model parameters, such as weight matrices.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

As shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one embodiment, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined. During the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel in the horizontal direction (or two pixels by two pixels … … depending on the value of the step size stride), so as to complete the extraction of the specific feature from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The dimensions of the multiple weight matrixes are the same, the dimensions of the feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted feature maps with the same dimensions are combined to form the output of convolution operation.

Generally, the weight values in the weight matrix need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to perform correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, similar to the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (131, 132 to 13n as shown in fig. 3) and an output layer 140 may be included in the neural network layer 130.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 3 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

Referring to fig. 5, the present application further provides a system architecture 300. The execution device 210 is implemented by one or more servers, optionally in cooperation with other computing devices, such as: data storage, routers, load balancers, and the like; the execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The execution device 210 may use data in the data storage system 250 or call program code in the data storage system 250 to implement the steps of the neural network training method corresponding to fig. 6-11 below.

The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 210 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof. In particular, the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like. The wireless network includes but is not limited to: a fifth Generation mobile communication technology (5th-Generation, 5G) system, a Long Term Evolution (LTE) system, a global system for mobile communication (GSM) or Code Division Multiple Access (CDMA) network, a Wideband Code Division Multiple Access (WCDMA) network, a wireless fidelity (WiFi), a bluetooth (bluetooth), a Zigbee protocol (Zigbee), a radio frequency identification technology (RFID), a Long Range (Long Range ) wireless communication, a Near Field Communication (NFC), or a combination of any one or more of these. The wired network may include a fiber optic communication network or a network of coaxial cables, among others.

In another implementation, one or more aspects of the execution device 210 may be implemented by each local device, e.g., the local device 301 may provide local data or feedback calculations for the execution device 210.

It is noted that all of the functions of the performing device 210 may also be performed by a local device. For example, the local device 301 implements functions to perform the device 210 and provide services to its own user, or to provide services to a user of the local device 302.

In general, knowledge distillation can transfer knowledge of one network to another, and both networks can be homogeneous or heterogeneous. The method is to train a teacher network or named teacher model and then train the student network or named student model by using the output of the teacher network. When knowledge distillation is carried out, a complex network trained in advance can be adopted to train another simple network, so that the simple network can have the same or similar data processing capacity as the complex network.

Knowledge distillation can quickly and conveniently realize small networks, for example, a complex network model of a large amount of data can be trained on a cloud server or an enterprise server, then knowledge distillation is carried out to obtain a small model realizing the same function, and the small model is compressed and migrated to small equipment (such as a mobile phone, a smart band and the like). For another example, by collecting data of a large number of users on the smart band, performing complex and time-consuming network training on the cloud server to obtain a user behavior recognition model, and then compressing and migrating the model to the smart band, which is a small-sized carrier, the model can be trained quickly while protecting the privacy of the users, and the user experience is improved.

Furthermore, in some scenarios, the accuracy of the detection results of the neural network is very important.

For example, lane marking detection is a key technology for automatic driving, and mainly performs lane marking and road sign detection to serve subsequent modules such as regulation and positioning. Some common lane line detection methods mainly use a deep learning-based detection algorithm for detection. Therefore, it is necessary to rely on large-scale annotation data for lane line model training. Because large-scale data manual labeling consumes huge time cost and economic cost, automatic data screening and selection of the most valuable data are very important. The active learning is a typical data selection strategy, is used for reducing the cost of data annotation, and means that a part of data requests for annotation are automatically selected from a data set through an automatic machine learning algorithm. The commonly used active learning strategy is to select representative data samples with rich information according to uncertainty and diversity of the samples and perform manual annotation. The higher the uncertainty of the sample is, the poorer the discrimination capability of the model for the sample is, and the sample may contain abundant information. The diversity is mainly used for describing whether the sample is representative or not, and the sample is prevented from being repeated in the sample selection process. The conventional active learning strategy has little benefit in the task of lane line detection, and cannot achieve the purpose of screening high-efficiency data. The main reasons include the following 2 aspects, on one hand, because the lane line marking has a large amount of shielding phenomena, the marked sample has a noise problem, if noise data is added into a training sample, the performance of the model is deteriorated, the existing uncertainty-based active learning strategy cannot effectively model the marked noise, that is, the existing method cannot effectively distinguish the difference between the noise and the high uncertainty, and therefore the selected sample contains a large amount of noise data; on the other hand, the data obtained by screening by the existing method based on sample diversity cannot be used for measuring the diversity of lane line data, because the lane line sample has the unique characteristics of sparsity, long and narrow shape and the like.

Therefore, the neural network training method based on active learning combines knowledge distillation and active learning, and the uncertainty or diversity of the samples is measured through the difference between the output results of various different neural networks, so that more useful samples are effectively screened for learning, the training efficiency of the model is improved, and the output effect of the model is improved.

Based on the system architecture or the neural network provided in fig. 1 to 5, the following describes the neural network training method provided in the present application in detail.

First, some concepts related to the present application are explained for easy understanding.

Active learning (active learning): active learning refers to automatically selecting partial data from a data set through an automatic machine learning algorithm to request experts to manually label or use a trained neural network to output labels, and is also called query learning or optimal experimental design statistically. Active learning continuously selects data from unmarked data through designing a reasonable query function, adds the data into a mark, and puts the data into a training set. The effective active learning data selection strategy can effectively reduce the training cost and improve the recognition capability of the model.

Uncertainty (uncertaintiy): in the field of active learning, it refers to the richness of the data sample relative to the information learned by the model, i.e. the size of the information amount included in the sample. In general, the amount of information included in a sample may be understood as the amount of data included in the sample or the number of types of data. For example, taking a sample in a lane line detection scenario as an example, the higher the uncertainty of the sample, the more information the sample includes, such as the greater the number of lane lines included in the sample, the more complex the background, or the more various the lane lines. For another example, taking a sample in a target detection scene as an example, the higher the uncertainty of the sample, the higher the number, category, or complexity of objects included in the sample.

Diversity (diversity): in the field of active learning, we refer to the diversity of samples relative to the original resource pool. Specifically, each sample can be generally classified into a category or a scene, and the diversity can be understood as the degree of representation of the category or the scene to which the sample corresponds. For example, if the data set includes images corresponding to 10 scenes, the diversity of the samples may be understood as whether the samples can be used as representative samples of the corresponding scenes, or the probability that the samples are used as representative samples of the corresponding scenes.

Knowledge distillation (knowledge distillation): knowledge distillation is a model compression method, is a training method based on the teacher-student network thought, and extracts Knowledge (Knowledge) and distillation (Distill) contained in a trained teacher model into a student model.

The neural network training method provided by the present application can be applied to various scenarios for training a neural network, and the following exemplary scenarios are some practical scenarios as examples to illustrate application scenarios of the neural network training method provided by the present application, but not limited thereto.

Scene one, target detection in unmanned vehicle sensing system

The target detection in the unmanned vehicle sensing system comprises pedestrian detection, vehicle detection, traffic sign detection, lane line detection and the like, and the detection network is required to have high accuracy to ensure the safety in driving, and the network is required to have sufficiently fast response speed to ensure the accurate control of the vehicle, and also for the deployment of an embedded system, the small model size and the high energy utilization efficiency are required to be ensured. Through the method provided by the application, the model which is smaller in structure and better in output effect and is obtained through knowledge distillation can be deployed in the vehicle, so that the accuracy of lane detection is improved, and the driving safety of the vehicle is improved.

Scene two, cloud platform everything detection

The target detection is the service with the largest demand on the cloud platform, and the efficient network structure is quickly searched out to complete service delivery in the face of a service data set or service data update newly submitted by a user.

Referring to fig. 6, a schematic flow chart of a neural network training method provided in the present application is as follows.

It should be noted that, in the active learning-based neural network training method provided by the present application, one or more iterations may be performed, and a teacher model or a student model and the like may be trained in each iteration process until a model meeting a preset condition is obtained, and the following description will be given by taking only one iteration process as an example.

601. And respectively taking the samples in the data set as the input of the teacher model, the first student model and the second student model to obtain a first output result, a second output result and a third output result.

The data set includes a plurality of unlabelled samples, that is, samples that do not carry tags (labels), such as images, voices, or texts that do not carry tags. The samples in the data set may be used as inputs to a teacher model, a first student model, and a second student model, respectively, to obtain a first output result, a second output result, and a third output result.

It should be understood that all or a portion of the samples in the data set may be used as input to the teacher model, the first student model, and the second student model, and that any one of the samples (e.g., referred to as the first sample) is exemplary in this application.

Specifically, the teacher model and the student models may be configured to perform target detection, classification task, voice recognition, image recognition or segmentation task, and the like, so that the output results of the teacher model, the first student model and the second student model may also include a target detection result, a classification result, a voice recognition result, an image recognition result, a foreground segmentation result, a background segmentation result, and the like, and may be specifically adjusted according to an actual application scenario. The present application exemplarily illustrates an example in which output results of the teacher model and the student model are expressed as vectors.

Alternatively, the first student model and the second student model may have the same or close network structures, such as the same or close number of network layers, the same or close number of base units included in each network layer, and so on. And the output effect of the teacher model is superior to that of the student model, for example, the output precision of the teacher model is higher than that of the student model. Accordingly, the model structure of the teacher model is also typically larger than the student model, and the model complexity is higher than the student model.

In a possible implementation manner, before step 601, that is, during the last iteration, the initial teacher model and the initial first student model may be trained separately using a training set to obtain the teacher model and the first student model, and the initial second student model may be subjected to knowledge distillation using the teacher model obtained after training, or during the training to obtain the teacher model, the student model may be subjected to knowledge distillation using the teacher model to obtain the second neural network. The training set may include a plurality of samples carrying labels to enable fully supervised learning.

It should be understood that, in the embodiment of the present application, the mentioned difference between the data set and the training set is that the sample in the data set does not carry a label, but the sample in the training set carries a label, and the label may be a label labeled manually or obtained through a neural network output trained in advance, which may be determined according to an actual application scenario, and the application does not limit the manner of adding the label.

602. And screening out at least one unmarked sample from the data set through the first deviation and the second deviation.

After the output results of the teacher model, the first student model and the second student model are obtained respectively, at least one unmarked sample can be screened out from the data set according to the deviation among the calculated output results, and if the sample with larger deviation can be selected for marking. Here, the deviation between the output results may be understood as a learning degree of the student model learning the sample, or a size of an amount of information included in the sample.

Specifically, the distance between the first output result and the third output result may be calculated to obtain the first deviation, and the distance between the second output result and the third output result may be calculated to obtain the second deviation, which may be understood as measuring the deviation between the output results of the model by the distance. The distance may be an euclidean distance, a mahalanobis distance, or a distance calculation method for measuring a difference between vectors, and may be specifically selected according to an actual application scenario, which is not limited in the present application.

In one possible implementation, an uncertainty of the number of samples may be calculated from the first deviation and the second deviation, and the uncertainty may be used to indicate a size of an amount of information included in the samples, the amount of information referring to a number or a category of data included in the samples, and the like. At least one unlabeled sample is then screened from the dataset according to the uncertainty. For example, a sample with a high uncertainty can be selected from the data set, thereby obtaining at least one unlabeled sample. Therefore, in the embodiment of the application, the neural network can be trained in a distillation mode, and the uncertainty of the sample can be represented by combining the deviation between the output results of the neural network obtained by independent training and distillation, so that more accurate uncertainty can be obtained, the sample which needs to be learned more conveniently can be screened, and the overall learning efficiency of the model can be improved.

Optionally, the manner of calculating the uncertainty may include: and fusing the first deviation and the second deviation according to the ratio of the first deviation and the second deviation to obtain the uncertainty of the sample. Specifically, the ratio between the first deviation and the second deviation may be used to determine a coefficient for fusing the first deviation and the second deviation, and the specific fusion manner may include summation, weighted summation, or averaging, and may specifically be selected according to the actual application scenario.

In a possible implementation manner, the diversity of the samples can be calculated according to the second output result to obtain the diversity metric of the samples, so that at least one unlabeled sample can be screened out from the data set according to the diversity to perform subsequent labeling and training, the neural network can learn data with higher diversity, and the output effect of the model is improved.

Optionally, taking the first sample as an example, the second output result may include features extracted from the sample by the first student model, and the specific way of calculating the diversity measure may include: a plurality of inverse nearest neighbors of the sample are determined from the characteristics of the sample, and a diversity measure of the first sample relative to the data set is then calculated from the inverse nearest neighbors. Therefore, in the embodiment of the present application, the diversity of the samples can be calculated by the inverse nearest neighbor of the features of the samples, so that the diversity of the samples can be more accurately indicated, and more useful samples can be selected according to the diversity for labeling.

In one possible embodiment, the uncertainty and diversity metrics may be combined to screen the dataset for at least one unlabeled sample. Therefore, in the embodiment of the application, unlabelled samples can be screened by combining the diversity and uncertainty of the samples, and if the samples with higher diversity can be selected to facilitate subsequent labeling and learning, the diversity of the information learned by the model can be improved, and the output effect of the model is improved.

Specifically, the uncertainty and the diversity metric of the samples may be weighted to obtain a score for each sample, and then at least one unlabeled sample may be selected from the dataset based on the score for each sample. For example, samples with scores exceeding a predetermined value may be selected from the data set, thereby obtaining at least one unlabeled sample. Therefore, in the embodiment of the application, the sample can be screened by combining the uncertainty and the diversity measurement of the sample, and the representative sample with high noise and high uncertainty is screened out, so that the subsequent labeling and training are facilitated, the neural network can learn data with higher diversity, and the output effect of the model is improved.

603. And acquiring at least one marked sample, and updating the at least one marked sample into the training set to obtain an updated training set.

After at least one unlabeled sample is screened from the dataset, a label may be added to the at least one unlabeled sample, thereby obtaining at least one labeled sample in a one-to-one correspondence.

The method for adding the label to the unlabeled sample specifically includes manually labeling the unlabeled sample, or outputting the label of each sample through a pre-trained neural network.

And after obtaining at least one labeled sample, updating the at least one labeled sample to a training set to obtain an updated training set.

604. And training the teacher model and the first student model by using the updated training set to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

After the updated training set is obtained, the teacher model and the first student model are trained respectively by using the updated training set, and the teacher model and the first student model trained in the current iteration are obtained.

Wherein, in the course of obtaining the teacher model after training or training the teacher model, can use the teacher model to carry out knowledge distillation to second student model, obtain the second student model after training. The specific distillation mode can be various, for example, the intermediate layer of the teacher model is used for knowledge distillation of the student model, or the output result of the teacher model is used for knowledge distillation of the student model, or the output results of the intermediate layer and the output layer of the teacher model can be combined for knowledge distillation of the student model, and the like.

605. If the predetermined condition is not met, go to step 606, otherwise go to step 601.

After the updated training set is used to train the teacher model, the first student model and the second student model, it may be determined whether the training result satisfies a preset condition, and after the preset condition is satisfied, the model obtained by the final training is output, and if the preset condition is not satisfied, the iteration may be continued, that is, step 601 is continued.

The preset condition may specifically include one or more of the following: the iteration times reach the preset times, or the time length for performing iterative training reaches the preset time length, or the like, or the change of the output precision of the second student model is smaller than the preset value, or the number of samples in the training set reaches the preset number, or the like, and the selection can be specifically performed according to the actual application scene.

606. And outputting the final model.

After the fact that the preset conditions are met is determined, a model obtained through final training can be output, for example, a teacher model after training, a second student model after training and the like can be output, or a new model can be trained by using a final training set, so that the trained new model is output. Of course, the trained teacher model may also be output, for example, the trained teacher model is deployed in the client, which is not limited in this application, and the matched model may be selected according to the actual application scenario.

Therefore, in the implementation mode of the application, in the active learning process, the knowledge distillation is combined to select the samples including more information or not learned to label, so that the training set with high learning difficulty and high scene diversity can be screened out, the training effect of the model can be improved, and the output effect of the model can be improved. And moreover, because a sample more suitable for learning is screened out, the labeling of useless samples is reduced, and the labeling cost can be reduced.

The foregoing describes a flow of the neural network training method provided by the present application, and for convenience of understanding, the following describes the neural network training method provided by the present application in more detail with reference to specific application scenarios.

The following description is given by taking a lane line detection scenario as an example, and it should be understood that the lane line detection mentioned below may be replaced by other tasks, such as target detection, classification task, segmentation task, etc., and may be specifically adjusted according to an application scenario.

First, the flow of another neural network training method provided in the present application can refer to fig. 7.

The general flow is first described for ease of understanding.

The neural network training method provided by the application can be divided into a training stage and a screening stage.

In the training phase, the Student models Student (i.e., the first Student model), Student-KD (i.e., the second Student model), and Teacher (i.e., the Teacher model) can be trained separately using the training set. The Student and the Teacher can be trained independently by using the training set, the detection loss of the Student and the detection loss of the Teacher are output, and the parameters of the model are updated by back propagation based on the loss values. And knowledge distillation is carried out on the Student-KD by using the training set and the trained Teacher, the loss of the output of the middle layer or the output layer of the Teacher and the Student-KD is calculated, and the parameter of the Student-KD is updated based on the loss value to obtain the trained Student-KD.

In the screening stage, unlabelled samples are respectively used as the input of the Student, Student-KD and Teacher to respectively obtain lane line detection results. Wherein the output of the Student may also include features extracted from the input image. The distance between the output results of Student and Student-KD, i.e. distance 1, can be calculated, and the distance between the output results of Student-KD and Teacher can be calculated, resulting in distance 2, and the uncertainty can be calculated from distance 1 and distance 2. And the diversity of the sample relative to the resource pool can be calculated according to the characteristics in the result output by the Student. And then, scoring the samples by combining uncertainty and diversity, screening unlabeled data, screening one or more unlabeled samples for labeling, updating a training set by using the labeled samples, and performing next iteration by using the updated training set until a model meeting conditions is obtained or the iteration frequency reaches a preset frequency, or the number of added labeled samples reaches a preset number, and the like, wherein the selection can be specifically performed according to an actual application scene.

The training phase and the screening phase are each exemplified below.

First, training phase

The lane line model training stage relates to 3 models in total, the Student model can use a PointLaneNet model, and a backbone network (backbone) is a resnet122 structure; the Teacher model is a PointLaneNet model structure, and the backbone is a senet154 structure; the Student-KD model and the Student model can be identical or close in structure.

As shown in FIG. 8, annotated data is input to the Student model for individual training, and Student model parameters can be trained. And inputting the marked data, namely the samples in the training set, into the Teacher model for individual training, and obtaining parameters of the Teacher model through training. And finally, training the Student-KD model, and transmitting richer knowledge contained in the Teacher model to the Student-KD model by constructing a loss value between the Student-KD and the Teacher model.

For example, the distillation process may be accomplished using a characteristic layer distillation and an output layer distillation. And the characteristic layer distillation is to guide the updating of the characteristic layer of the student model according to the output result of the characteristic layer of the teacher model, and the output layer distillation is to guide the updating of the student model according to the output result of the output layer of the teacher model. Therefore, the output result of the characteristic layer of the student model is closer to the output result of the middle layer of the teacher model, the output result of the output layer of the student model is closer to the output result of the output layer of the teacher model, and knowledge distillation is achieved.

The feature layer distillation is used for helping students learn feature codes of teacher networks about foregrounds in online learning, specifically, a target area feature alignment (ROI Align) module in a two-stage detector can be used for extracting foregrounds in output features of the FPN and filtering background noise, and in the distillation process, a loss function can be used for loss of Mean Square Error (MSE), such as expressed as:

wherein, F_SAnd F_TFeatures representing the student model and the teacher model, respectively, f_adapFor the feature adaptive mapping function, L, W, H, C denote the hierarchy and dimension of the feature, N_eC × W × H. After the loss value is calculated, the student model can be reversely updated based on the loss value, so that an updated student model is obtained.

For the output layer distillation, a cross-entropy loss L can be employed_clsTo construct a loss value for the classification task, and for the positioning task, a positioning uncertainty L carrying a similar attention mechanism can be used_locTo more effectively migrate the teacher's network with respect to the location information of the object, can be expressed as:

wherein N represents the number of candidate regions,

a score representing the category for which the ith candidate region is predicted in the teacher network,

denotes the score of the category predicted by the ith candidate area in the student network, and C denotes the number of categories.

Positioning information representing the prediction of the ith candidate region in the teacher network (e.g., an offset in the xyz direction from the anchor point),

representing the confidence of the current candidate region predicted by the teacher model on the ith class.

Second, screening stage

In the screening stage, the trained three models are used for reasoning the unlabeled data respectively, the deviation (distance) between the Student and the Student-KD and the deviation (distance) between the Student-KD and the Techer can be calculated for each sample, and the uncertainty (uncertaintiy) is calculated according to a formula.

An influence Set (influence Set) for each sample is constructed for the unlabeled data, and then sample diversity (diversity) can be calculated. By combining the uncertainty and diversity of the samples, samples with rich information content can be selected from the unlabeled data set.

The manner in which the uncertainty and diversity are calculated is described in detail below.

1. Degree of uncertainty

First of all, the first step is to,the deviation between the output results of the model can be defined as: given model M₁And M₂They infer the bias on the sample p as defined by:

wherein M is₁(p) is M₁Inference result of (1), M₂(p) is M₂Inference result of l₁Information on a lane line, e.g. lane line position, length, etc., obtained for reasoning,/₂Is M₂(p) neutralization of₁And information of a lane line with the closest spatial distance. Dist is l₁And l₂The euclidean distance between.

In the screening stage in fig. 7, unlabeled data (unlabeled data) is input into the models obtained by training in the training stage, for example, the sample p can obtain the detection results corresponding to 3 models respectively. Definition D_SSAs the deviation between Student-KD and Student's output, D_STIs the deviation between the output results of Student-KD and Teacher.

The uncertainty of p is expressed as:

according to D_SSAnd D_STThe data is divided into a plurality of classes, as shown in table 1:

TABLE 1

(1)D_SSSmall, D_STSmall: the 3 models can accurately predict the sample, so that the sample belongs to a simple sample and manual marking is not needed.

(2)D_SSSmall, D_STLarge: there is a large gap between the Teacher model and Student-KD, indicating that Studentthe t-model does not learn the sample well to the Teacher model, possibly due to knowledge difficulties or data noise. D_SSSmall indicates that the sample is not a noisy sample, and if the sample is noisy, D_SSThe difference will be relatively large. Thus, the sample contains valuable information, preferably manually labeled.

(3)D_SSLarge, D_STSmall: d_STSmall size indicates that the sample contains knowledge that is easily learned, but D_SSIt is highly indicative that the sample is likely that the Teacher model passed the wrong knowledge to the Student-KD model. Therefore, the sample also contains valuable information, and manual labeling is preferred.

(4)D_SSLarge, D_STLarge: the results of the 3 model reasoning are different and have larger difference. One reason is that the sample itself is noisy and not a valuable sample, and the other is that the sample is too complex, resulting in 3 models that cannot be predicted accurately. This sample is a low priority sample for the 2 reasons mentioned above.

It should be noted that, the foregoing size and size are relative, and the specific division manner can be determined according to the actual application scenario, such as D of sample 1_SS0.1, D of sample 2_SSIs 0.2, then D of sample 1 can be understood_SSSmall, sample 2D_SSIs large.

2. Diversity

As shown in fig. 9, a Student model is used to extract features for each sample of the unlabeled sample set, and the features can be identified by a vector or a matrix. Then, for each sample, in the unlabeled data set S_UThe Nearest Neighbors (RNN) are queried to construct the Influence Set (Influence Set).

As shown in fig. 10, point p is the nearest neighbor of its 5 nearby points, i.e., RNNs. Given a sample p, a set of unlabeled samples S_UThe sample set Q that has been currently selected, where Q ∈ S_UThe sample diversity metric is defined as:

Div(p|Q,S_U)＝|RNN(p)-Q|

wherein, the nearest neighbor can be understood as: given a data Set P, P ∈ P can find the Set rnn (q) of Nearest neighbors (Nearest Neighbor) q, the inverse Nearest Neighbor of q, of the sample P in the data Set P, called the impact Set of q (Influence Set). Can be calculated by the following formula:

the sample may then be scored in combination with the uncertainty and diversity metric values, which may be expressed as:

score＝Uncr+βDiv

where β is a weight coefficient and score is the score of the sample.

The first k samples with the highest score value may be selected for labeling, the specific labeling manner may be manual labeling, or labeling may be performed by a trained neural network, and the value of k may be determined according to an actual application scenario, for example, if low-cost training needs to be implemented, a lower value of k may be set.

After k samples are screened out and added into the training set, the training stage can be executed again until the number of the added and labeled samples reaches the set number or a neural network with an output effect reaching the expected value is obtained, and the like.

Therefore, in the embodiment of the present application, a knowledge-based neural network training method is provided, in which a strategy of combining uncertainty and diversity is used to evaluate a sample, and then a sample with rich information is selected. A sample set with high difficulty and strong scene diversity can be selected. And then promote the model training effect, reduce artifical mark volume, practice thrift the mark cost. And based on knowledge distillation, the uncertainty of the sample is calculated by using the learning capability difference of different models (Student, Student-KD and Teacher), so that the high-noise sample and the high-uncertainty sample can be effectively distinguished, and the uncertainty of the sample can be further evaluated. In addition, a diversity calculation mode based on the influx Set is provided, the traditional clustering algorithm is replaced, the diversity index of the sample is obtained through calculation, and the sample with high representativeness can be selected from the data Set.

In addition, the present application also provides a detection method, as shown in fig. 11, the detection method may include:

1101. an input image is acquired.

If the target model (i.e. the trained second student model) is deployed in the execution device 210 shown in fig. 5, the user may send the input image to the execution device through the

local device

301 or 302. For another example, after the device outputs the target model, the target model may be sent to the

local device

301 or 302 to deploy the target model to the

local device

301 or 302, and the user may directly input the input image in the

local device

301 or 302 to obtain the output result.

1102. And taking the input image as the input of the trained second student model, and outputting a detection result.

The detection result may include information of the detected object, and specifically may include a position (e.g., expressed as coordinates) of the object in the input image or a category of the object.

The process of obtaining the trained second student model may refer to the processes described in fig. 6 to fig. 10, and is not described herein again.

In the embodiment of the application, a knowledge distillation-based neural network training method is provided, a sample is evaluated by using a strategy combining uncertainty and diversity, and then the sample with abundant information is selected. A sample set with high difficulty and strong scene diversity can be selected. And then promote the model training effect, reduce artifical mark volume, practice thrift the mark cost. And based on knowledge distillation, the uncertainty of the sample is calculated by using the learning capability difference of different models (Student, Student-KD and Teacher), so that the high-noise sample and the high-uncertainty sample can be effectively distinguished, and the uncertainty of the sample can be further evaluated. In addition, a method based on the influx Set is provided, the traditional clustering algorithm is replaced, the diversity index of the sample is obtained through calculation, and the sample with high representativeness can be selected from the data Set. Therefore, the output result of the neural network obtained by training through the neural network training method provided by the application is more accurate, the neural network training method is suitable for more types of samples, and the generalization capability is strong.

The foregoing method flow provided by the present application is described in detail, and for convenience of understanding, the following describes an exemplary effect achieved by the method provided by the present application with reference to a specific application scenario.

For example, the output effects of models obtained in a plurality of common ways can be compared on two public data sets, namely the CULane and the LLAMAS, under the condition of using the same amount of manual labeling.

First, some common active learning approaches are introduced:

random (rand): a baseline random selection strategy.

Encopy (Ent): and selecting a sample with a high entropy value for marking.

Ensemble (Ens): and (3) reasoning the unlabeled data sets respectively by using the student and the teacher dual models, and selecting 2 samples with the maximum model reasoning result deviation for manual labeling.

ACD: the method is specially designed for target detection, and cross entropy of a sample is evaluated by using characteristic space information.

5. LLoss: and adding a head (head) in the output part of the model, specifically evaluating the loss of the sample, and selecting the sample with larger loss for labeling.

BADGE: samples were evaluated in combination with uncertainty (uncertainty was assessed using a sample inverse gradient) and diversity (sample diversity was assessed using the KMeans + + method).

The output effect of the model can be shown in fig. 12, and it is obvious that the output effect of the model obtained by the neural network training method provided in the present application (i.e., the outputs given in fig. 12) is better.

Specifically, to verify the validity of uncertainty evaluation, a plurality of strategies can be constructed for comparison, the 1 st strategy is the above-mentioned method 3, abbreviated as Ens, and the 2 nd strategy (the neural network training method provided by the present application) is based only on the uncertainty evaluation strategy based on knowledge distillation, abbreviated as KD-only, described herein, and the test results are shown in fig. 13. The uncertainty evaluation method provided by the application has better performance than that of the Ens and random Rand methods, and can effectively select effective samples from the unlabeled data set.

The KD-only is used as a diversity evaluation method, the K-means + + strategy is used as a diversity evaluation method, and the two are combined to be used as a comparative experiment group, namely KD + KM for short. Compared with the complete method provided by the present disclosure, as can be seen from fig. 13, the performance of the neural network training method provided by the present disclosure is still improved compared with the k-means + + strategy.

In addition, in the foregoing modes 2-5, the uncertainty of the sample is considered, and the diversity of the sample is not considered, so that the selected sample has the characteristics of difficulty, but the diversity is poor, and the scene is single. In the foregoing embodiment 6, the uncertainty and diversity of the sample are considered at the same time, but the angles at which the uncertainty and diversity of the sample are evaluated are different. In the aspect of evaluating the uncertainty, the mode 6 is to evaluate the uncertainty of the sample by calculating the size of the inverse gradient caused by the sample, and the larger the caused inverse gradient is, the richer the sample information is. However, since the data is not labeled, the calculated inverse gradient value is an estimated value, and is less accurate and cannot represent the uncertainty of the sample. In the aspect of evaluating diversity, a mode 6 firstly trains a k-means + + classifier, then reasoning is carried out on unlabeled data, and the unlabeled data are divided into different categories. However, since the diversity of samples is difficult to classify by a fixed category, the selected samples are not necessarily representative.

Therefore, the neural network training method provided by the application introduces knowledge distillation to obtain a Student model with a smaller model structure and high output precision, calculates the uncertainty of the sample by using the learning capability difference of different models (Student, Student-KD and Teacher), can effectively distinguish a high-noise sample from a high-uncertainty sample, and further evaluates the uncertainty of the sample. In addition, the method based on the influx Set is provided, the traditional clustering algorithm is replaced, the diversity index of the sample is obtained through calculation, the sample with high representativeness can be selected from the data Set, the training efficiency is improved, the output accuracy of the model is improved, the method can adapt to various scenes, and the generalization capability is strong.

The method flow provided by the present application is described in detail in the foregoing, and the apparatus provided by the present application is described below with reference to the foregoing method flow.

First, referring to fig. 14, the present application provides a neural network training device, including:

an input module 1401, configured to use a sample in the data set as an input of a teacher model to obtain a first output result, use the sample in the data set as an input of a first student model to obtain a second output result, and use the sample in the data set as an input of a second student model to obtain a third output result, where the sample in the data set does not carry a label, the teacher model and the first student model are obtained by using a training set for training, the second student model is obtained by using the teacher model for knowledge distillation, and the training set includes a sample carrying a label;

a screening module 1402, configured to screen out at least one unlabeled sample from the dataset by a first deviation and a second deviation, where the first deviation includes a distance between a first output result and a third output result, and the second deviation includes a distance between the second output result and the third output result;

an obtaining module 1403, configured to obtain at least one labeled sample, and update the at least one labeled sample to a training set to obtain an updated training set, where the at least one labeled sample is obtained by adding a label to at least one unlabeled sample;

and the training module 1404 is used for training the teacher model and the first student model by using the updated training set to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

In a possible implementation, the screening module 1402 is specifically configured to: calculating the uncertainty of the sample in the data set according to the first deviation and the second deviation, and not determining the size of the information quantity included in the sample in the data set; at least one unlabeled sample is screened from the dataset according to the uncertainty.

In a possible implementation, the screening module 1402 is specifically configured to: calculating a diversity metric for the samples in the data set from the second output result, the diversity metric being indicative of the diversity of the samples in the data set relative to the data set; and screening out at least one unlabeled sample from the data set according to the uncertainty and the diversity measure.

In a possible implementation, the screening module 1402 is specifically configured to: carrying out weighted fusion on the uncertainty and the diversity measurement of each sample in the data set to obtain the score of each sample; and screening out at least one unmarked sample from the data set according to the score of each sample.

In a possible implementation, the screening module 1402 is specifically configured to: and according to the ratio of the first deviation to the second deviation, fusing the first deviation and the second deviation to obtain the uncertainty of the sample in the data set.

in a possible implementation, the screening module 1402 is specifically configured to: determining a plurality of inverse nearest neighbors of a first sample in the data set according to characteristics of the samples in the data set; a diversity metric for the first sample is calculated based on the plurality of inverse nearest neighbors.

Referring to fig. 15, the present application provides a schematic structural diagram of a detection apparatus, including:

an input module 1501 for acquiring an input image;

an output module 1502, configured to take the input image as an input of the trained second student model, and output a detection result;

the process of obtaining the trained second student model may refer to the steps shown in fig. 6 to fig. 13, and is not described herein again.

Referring to fig. 16, a schematic structural diagram of another neural network training device provided in the present application is as follows.

The neural network training device may include a processor 1601 and a memory 1602. The processor 1601 and the memory 1602 are interconnected by a line. The memory 1602 has stored therein program instructions and data.

The memory 1602 stores program instructions and data corresponding to the steps in fig. 6-13.

The processor 1601 is configured to perform the method steps performed by the neural network training device shown in any one of the foregoing embodiments of fig. 6 to 13.

Optionally, the neural network training device may further comprise a transceiver 1603 for receiving or transmitting data.

Also provided in embodiments of the present application is a computer-readable storage medium, which stores a program that, when executed on a computer, causes the computer to perform the steps in the method described in the foregoing embodiments shown in fig. 6 to 13.

Alternatively, the aforementioned neural network training device shown in fig. 16 is a chip.

Referring to fig. 17, a schematic structural diagram of another detecting device provided in the present application is shown as follows.

The detection apparatus may include a processor 1701 and a memory 1702. The processor 1701 and the memory 1702 are interconnected by wires. Among other things, memory 1702 has stored therein program instructions and data.

The memory 1702 stores program instructions and data corresponding to the steps of fig. 11 described above.

The processor 1701 is configured to execute the method steps performed by the detection apparatus shown in fig. 11.

Optionally, the detection apparatus may further include a transceiver 1703 for receiving or transmitting data.

Alternatively, the detection device shown in fig. 17 described above is a chip.

The embodiment of the present application further provides a neural network training device, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the method steps executed by the neural network training device shown in any one of the foregoing fig. 6 to fig. 13.

The embodiment of the present application further provides a detection apparatus, which may also be referred to as a digital processing chip or a chip, where the chip includes a processing unit and a communication interface, the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit, and the processing unit is configured to execute the method steps executed by the detection apparatus shown in fig. 11.

The embodiment of the application also provides a digital processing chip. Integrated with the digital processing chip are circuitry and one or more interfaces for performing the functions of the processor 1601, or the processor 1601 as described above. When integrated with memory, the digital processing chip may perform the method steps of any one or more of the preceding embodiments. When the digital processing chip is not integrated with the memory, the digital processing chip can be connected with the external memory through the communication interface. The digital processing chip implements the actions performed by the neural network training device in the above embodiments according to the program codes stored in the external memory.

Also provided in embodiments of the present application is a computer program product, which when run on a computer, causes the computer to perform the steps of the method as described in the foregoing embodiments shown in fig. 6 to 13.

The neural network training device provided by the embodiment of the application can be a chip, and the chip comprises: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer executable instructions stored in the storage unit to enable the chip in the server to execute the neural network training method or the neural network detection method described in the embodiments shown in fig. 6 to 13. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, the aforementioned processing unit or processor may be a Central Processing Unit (CPU), a Network Processor (NPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic devices (programmable gate array), discrete gate or transistor logic devices (discrete hardware components), or the like. A general purpose processor may be a microprocessor or any conventional processor or the like.

Referring to fig. 18, fig. 18 is a schematic structural diagram of a chip according to an embodiment of the present disclosure, where the chip may be represented as a neural network processor NPU 180, and the NPU 180 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1803, and the controller 1804 controls the arithmetic circuit 1803 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1803 includes multiple processing units (PEs) inside. In some implementations, the operational circuitry 1803 is a two-dimensional systolic array. The arithmetic circuit 1803 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 1803 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from weight memory 1802 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1801 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 1808.

The unified memory 1806 is used for storing input data and output data. The weight data directly passes through a Direct Memory Access Controller (DMAC) 1805, and the DMAC is transferred to the weight memory 1802. The input data is also carried into the unified memory 1806 by the DMAC.

A Bus Interface Unit (BIU) 1810 for interaction of the AXI bus with the DMAC and the Instruction Fetch memory (IFB) 1809.

The bus interface unit 1810(bus interface unit, BIU) is configured to obtain an instruction from the external memory by the instruction fetch memory 1809, and is further configured to obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 1805.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1806, to transfer weight data to the weight memory 1802, or to transfer input data to the input memory 1801.

The vector calculation unit 1807 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as batch normalization (batch normalization), pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1807 can store the processed output vector to the unified memory 1806. For example, the vector calculation unit 1807 may apply a linear function and/or a non-linear function to the output of the arithmetic circuit 1803, such as linear interpolation of the feature planes extracted by the convolutional layers, and further such as a vector of accumulated values to generate the activation values. In some implementations, the vector calculation unit 1807 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 1803, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 1809 connected to the controller 1804, configured to store instructions used by the controller 1804;

the unified memory 1806, the input memory 1801, the weight memory 1802, and the instruction fetch memory 1809 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The operation of each layer in the recurrent neural network can be performed by the operation circuit 1803 or the vector calculation unit 1807.

Where any of the aforementioned processors may be a general purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits configured to control the execution of the programs of the methods of fig. 6-13, as described above.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A neural network training method, comprising:

the method comprises the steps of taking a sample in a data set as an input of a teacher model to obtain a first output result, taking the sample in the data set as an input of a first student model to obtain a second output result, taking the sample in the data set as an input of a second student model to obtain a third output result, wherein the sample in the data set does not carry a label, the teacher model and the first student model are obtained by training through a training set, the second student model is obtained by performing knowledge distillation through the teacher model, and the training set comprises a sample carrying a label;

screening out at least one unlabeled sample from the dataset by a first deviation and a second deviation, the first deviation comprising a distance between the first output result and the third output result, the second deviation comprising a distance between the second output result and the third output result;

obtaining at least one marked sample, and updating the at least one marked sample into the training set to obtain an updated training set, wherein the at least one marked sample is obtained by adding a label to the at least one unmarked sample;

and training the teacher model and the first student model by using the updated training set to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

2. The method of claim 1, wherein the screening out at least one unlabeled sample from the dataset by a first bias and a second bias comprises:

calculating an uncertainty of a sample in the data set from the first deviation and the second deviation, the uncertainty being indicative of a magnitude of an amount of information included in the sample in the data set;

and screening the at least one unlabeled sample from the dataset according to the uncertainty.

3. The method of claim 2, wherein said screening said at least one unlabeled sample from said dataset according to said uncertainty comprises:

calculating a diversity metric for a sample in the dataset from the second output result, the diversity metric being indicative of the diversity of the sample in the dataset with respect to the dataset;

screening the at least one unlabeled sample from the dataset according to the uncertainty and the diversity metric.

4. The method of claim 3, wherein the second output result includes features extracted by the first student model from samples of the data set, the first sample being any one of the samples of the data set;

said computing a diversity metric for samples in said dataset from said second output result, comprising:

determining a plurality of inverse nearest neighbors of the first sample in the data set according to characteristics of samples in the data set;

calculating a diversity metric for the first sample from the plurality of inverse nearest neighbors.

5. The method of claim 3 or 4, wherein said screening said at least one unlabeled sample from said dataset according to said uncertainty and said diversity measure comprises:

performing weighted fusion on the uncertainty and the diversity metric of each sample in the data set to obtain a score of each sample;

and screening the at least one unlabeled sample from the data set according to the score of each sample.

6. The method according to any one of claims 2-5, wherein said calculating an uncertainty of a sample in said data set from said first deviation and said second deviation comprises:

and fusing the first deviation and the second deviation according to the ratio of the first deviation and the second deviation to obtain the uncertainty of the sample in the data set.

7. The method of any one of claims 1-6, wherein the samples in the training set are images including lane lines, and wherein the teacher model, the first student model, and the second student model are used to detect lane line information in the input images.

8. A method of detection, comprising:

acquiring an input image;

taking the input image as the input of the trained second student model, and outputting a detection result;

the trained second student model is obtained by training the second student model through active learning, and the active learning process comprises the following steps: the method comprises the steps of taking a sample in a data set as an input of a teacher model to obtain a first output result, taking the sample in the data set as an input of a first student model to obtain a second output result, taking the sample in the data set as an input of a second student model to obtain a third output result, wherein the sample in the data set does not carry a label, the teacher model and the first student model are obtained by training through a training set, the second student model is obtained by performing knowledge distillation through the teacher model, and the training set comprises a sample carrying a label; screening out at least one unlabeled sample from the dataset by a first deviation and a second deviation, the first deviation comprising a distance between the first output result and the third output result, the second deviation comprising a distance between the second output result and the third output result; obtaining at least one marked sample, wherein the at least one marked sample is obtained by adding a label to the at least one unmarked sample; training the teacher model and the first student model by using the at least one labeled sample to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain the trained second student model.

9. The method of claim 8, wherein during the active learning, the at least one unlabeled sample is selected from the dataset based on an uncertainty representing a magnitude of an amount of information included in a sample in the dataset, the uncertainty being derived based on the first bias and the second bias.

10. The method of claim 9, wherein during the active learning, the at least one unlabeled sample is screened from the dataset according to the uncertainty and a diversity metric, the diversity metric being indicative of a diversity of the sample in the dataset relative to the dataset, the diversity metric being derived from the second output result.

11. A neural network training device, comprising:

the input module is used for taking a sample in a data set as the input of a teacher model to obtain a first output result, taking the sample in the data set as the input of a first student model to obtain a second output result, and taking the sample in the data set as the input of a second student model to obtain a third output result, wherein the sample in the data set does not carry a label, the teacher model and the first student model are obtained by training with a training set, the second student model is obtained by carrying out knowledge distillation with the teacher model, and the training set comprises a sample carrying a label;

a screening module configured to screen out at least one unlabeled sample from the dataset by a first deviation and a second deviation, the first deviation including a distance between the first output result and the third output result, the second deviation including a distance between the second output result and the third output result;

the acquisition module is used for acquiring at least one labeled sample, updating the at least one labeled sample into the training set to obtain an updated training set, and adding a label to the at least one unlabeled sample to obtain the at least one labeled sample;

and the training module is used for training the teacher model and the first student model by using the updated training set to obtain a trained teacher model and a trained first student model, and performing knowledge distillation on the second student model by using the trained teacher model to obtain a trained second student model.

12. The apparatus of claim 11, wherein the screening module is specifically configured to:

13. The apparatus of claim 12, wherein the screening module is specifically configured to:

14. The apparatus of claim 13, wherein the second output includes features extracted by the first student model from samples of the data set, a first sample being any one of the samples of the data set;

the screening module is specifically configured to:

15. The apparatus according to claim 13 or 14, wherein the screening module is specifically configured to:

16. The apparatus according to any one of claims 12 to 15, wherein the screening module is specifically configured to:

17. The apparatus of any of claims 11-16, wherein the samples in the training set are images including lane lines, and wherein the teacher model, the first student model, and the second student model are used to detect lane line information in the input images.

18. A detection device, comprising:

the input module is used for acquiring an input image;

19. The apparatus of claim 18, wherein during the active learning, the at least one unlabeled sample is selected from the data set based on an uncertainty that represents a magnitude of an amount of information included in a sample in the data set, the uncertainty being derived from the first bias and the second bias.

20. The apparatus of claim 19, wherein during the active learning, the at least one unlabeled sample is screened from the dataset according to the uncertainty and a diversity metric, the diversity metric being indicative of a diversity of the sample in the dataset relative to the dataset, the diversity metric being derived from the second output result.

21. A neural network training device comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of any one of claims 1 to 7.

22. A detection apparatus comprising a processor coupled to a memory, the memory storing a program, the program instructions stored by the memory when executed by the processor implementing the method of any of claims 8 to 10.

23. A computer readable storage medium comprising a program which, when executed by a processing unit, performs the method of any of claims 1 to 7 or 8 to 10.

24. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any one of claims 1 to 7 or 8 to 10 when executed by a processor.