CN113191489B - Training method of binary neural network model, image processing method and device - Google Patents

Training method of binary neural network model, image processing method and device Download PDF

Info

Publication number
CN113191489B
CN113191489B CN202110494162.5A CN202110494162A CN113191489B CN 113191489 B CN113191489 B CN 113191489B CN 202110494162 A CN202110494162 A CN 202110494162A CN 113191489 B CN113191489 B CN 113191489B
Authority
CN
China
Prior art keywords
neural network
binary
matrix
weight matrix
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110494162.5A
Other languages
Chinese (zh)
Other versions
CN113191489A (en
Inventor
刘传建
王云鹤
韩凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110494162.5A priority Critical patent/CN113191489B/en
Publication of CN113191489A publication Critical patent/CN113191489A/en
Application granted granted Critical
Publication of CN113191489B publication Critical patent/CN113191489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to an image processing technology in the field of computer vision in the field of artificial intelligence, and discloses a binary neural network model training method, an image processing method and an image processing device. The training method comprises the following steps: s1: determining a knowledge distillation framework; the teacher network is a trained neural network model, and the student network is an initial binary neural network model M 0 (ii) a S2: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a The target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between an included angle between the characteristic matrix and the weight matrix in the teacher network and an included angle between the characteristic matrix and the weight matrix in the student network; s3: when a preset condition is met, the binary neural network model M is used j+1 As a target binary neural network model; otherwise let j = j +1 and repeat step S2. According to the application embodiment, the prediction accuracy of the binary neural network model can be improved.

Description

Training method of binary neural network model, image processing method and device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a training method of a binary neural network model, an image processing method and an image processing device.
Background
Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, and medical diagnosis, and is a study on how to use a camera/camcorder and a computer to acquire data and information of a photographed object, which are required by us. In a descriptive sense, a computer is provided with eyes (camera/camcorder) and a brain (algorithm) to recognize, track, measure, etc. a target instead of human eyes, thereby enabling the computer to perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make an artificial system "perceive" from images or multidimensional data. Generally, computer vision is to use various imaging systems to obtain input information instead of visual organs, and then use computer to process and interpret the input information instead of brain. The ultimate research goal of computer vision is to make a computer have the ability to adapt to the environment autonomously by visually observing and understanding the world like a human.
Image Classification (IC), object Detection (OD) and Image Segmentation (IS) are important problems in high-level visual semantic understanding tasks, and with the rapid development of artificial intelligence technology, the three basic tasks are more and more widely applied in the field of computer vision. Deep convolutional neural networks have an increasingly important position in the three basic tasks, especially target detection, but deep convolutional neural network models usually have millions of parameters, which require billions of floating point operations (FLOPs) for calculation, which limits the deployment of deep convolutional neural networks on platforms with limited resources. In order to realize efficient online reasoning on embedded equipment, a deep convolutional neural network model is generally quantized at present to obtain a binary neural network model for carrying out the basic task in the computer vision; the parameter of the binary neural network model is used for determining the parameter of the deep convolutional neural network model, and the parameter of the binary neural network model is used for determining the parameter of the deep convolutional neural network model.
However, the binary neural network model in the prior art is more degraded in prediction accuracy than the neural network model.
Disclosure of Invention
The embodiment of the application provides a training method of a binary neural network model, an image processing method and an image processing device, and can effectively improve the prediction precision of the binary neural network model obtained through training.
In a first aspect, the present application provides a training method for a binary neural network model, including: s1: determining a knowledge distillation framework; wherein, the teacher network in the knowledge distillation frame is a trained neural network model, and the student network in the knowledge distillation frame is an initial binary neural network model M 0 The teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer;s2: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Wherein, the binary neural network model M j Is obtained based on the jth batch of image training, wherein j is a positive integer; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of a j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N; s3: when a preset condition is met, the binary neural network model M is used j+1 As a target binary neural network model; otherwise let j = j +1 and repeat step S2.
It should be appreciated that during each training process, the model parameters in the student network are adjusted in a direction toward which the objective loss function value is minimized. That is, the smaller the angle loss term in the objective loss function is, the smaller the performance difference between the student network and the teacher network is.
It can be seen that, in the embodiment of the application, a teacher network trained in a knowledge distillation framework is used to guide a training process of a student network, and an angle loss term is designed in a target loss function to update parameters in the student network, so that on one hand, a feature extraction result of the student network on an input sample is close to a feature extraction result of the teacher network on the input sample, and on the other hand, an angle between a binary weight matrix and a binary input matrix in the student network is close to an angle between the weight matrix and the input matrix in the teacher network. In conclusion, compared with the angle loss after quantization is not considered in the training process of the binary neural network model in the prior art, the method and the device can enable the performance of the student network after training to be close to the performance of the teacher network to the maximum extent by introducing the knowledge distillation frame and the angle loss item in the loss function, and therefore the prediction accuracy of the target binary neural network model obtained through training in the embodiment of the application is improved.
In a possible embodiment, the objective loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network.
It should be appreciated that the smaller the convolution result loss term in the objective loss function, the smaller the objective loss function loss value, indicating that the performance of the student network and the teacher network are closer together.
It can be seen that, in the embodiment of the application, the convolution result loss term is introduced into the target loss function to enable the second convolution output result in the student network to be as close as possible to the first convolution output result in the teacher network, that is, the output result of each layer of neural network in the student network is as close as possible to the output result of each layer of neural network in the teacher network, so that the output predicted value of the student network is ensured to be close to the predicted value of the teacher network, and the prediction accuracy of the target binary neural network model obtained after training is improved.
In a possible embodiment, the objective loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
It can be seen that, in the embodiment of the application, a weight loss term representing the difference between the weight matrix in the teacher network and the binary weight matrix in the student network is introduced into the target loss function, and the weight loss term, the convolution result loss term and the angle loss term are used for training the student network together. The performance of the target neural network model obtained after training can be improved to the maximum extent by introducing the three model performance measurement indexes into the target loss function to train the student network, so that the performance of the target neural network model is close to the performance of the teacher network to the maximum extent.
In one possible embodiment, the binary neural network model M is trained using the j +1 th batch of images and the target loss function j To obtain a binary neural network model M j+1 The method comprises the following steps: inputting the j +1 th batch of images into a binary neural network model M j Obtaining a predicted value of the j +1 th batch of images; updating a binary neural network model M based on a predicted value of a j +1 th batch of images, a label of the j +1 th batch of images and a target loss function j Obtaining parameters in each layer of neural network to obtain a binary neural network model M j+1
It should be understood that after the forward propagation process of one training is finished, the embodiment of the present application may obtain an image prediction value based on the end of the forward propagation process; and then calculating a loss function value in the training process based on the image predicted value, the image label and the target loss function, and updating the model parameters based on the target loss function value.
It can be seen that, in the embodiment of the present application, during each training process of the binary neural network model, parameters in the student network may be updated layer by layer through the prediction value of the image used in each training, the label of the image used, and the angle loss term, the convolution result loss term, and the weight loss term included in the above target loss function. In summary, in the embodiment of the application, on one hand, by comparing the difference between the predicted value of the student network and the image label, and on the other hand, by comparing the difference between the corresponding model parameters (corresponding to the weight loss item in the target loss function) of the student network and the teacher network and the difference between the model intermediate calculation results (corresponding to the angle loss item and the convolution result loss item in the target loss function), the performance of the trained target binary neural network model in feature extraction and predicted value can be closer to that of the teacher network, that is, the model prediction accuracy can be improved.
In a possible embodimentInputting the (j + 1) th batch of images into a binary neural network model M j And obtaining a predicted value of the j +1 th batch of images, wherein the method comprises the following steps: p1: model M based on binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using the reference weight matrix and the probability matrix corresponding to the ith layer of neural network; p2, obtaining a second convolution output result of the ith layer neural network according to the binary input matrix and the binary weight matrix of the j +1 th batch of images in the ith layer neural network; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix, and the element at any position in the reference weight matrix; p3: and (5) enabling i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the j +1 th batch of images based on a second convolution output result of the Nth-layer neural network.
It can be seen that, in the embodiment of the present application, in the forward propagation process of each layer of neural network during each training, a binary weight matrix of each layer of neural network is determined according to a reference weight matrix and a probability matrix corresponding to each layer of neural network, and then a second convolution output result of each layer of neural network is obtained through calculation based on a binary input matrix and a binary weight matrix of each layer of neural network, and further, when the forward propagation is performed to an nth layer of neural network, an image prediction value output by a model during the training process can be obtained through calculation according to the second convolution output result of the nth layer of neural network. And in the subsequent back propagation process, calculating a loss function value in the training process by using a second convolution output result and a first convolution output result obtained in the forward propagation process, an image prediction value and an image label, and further adjusting the model parameters according to the loss function value to ensure that an optimal target binary neural network model is obtained.
In a possible embodiment, the reference weight matrix includes a first reference weight matrix and a second reference weight matrix, and the probability matrix includes a first probability matrix and a second probability matrix; model M based on binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using the reference weight matrix and the probability matrix corresponding to the ith layer of neural network, wherein the binary weight matrix comprises: based on any position element in the first reference weight matrix in the first probability matrixThe element at any position in the target binary weight matrix is determined by the corresponding first probability value and the corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
It can be seen that, in the embodiment of the present application, the reference weight matrix in each layer of neural network in the student network includes a first reference weight matrix and a second reference weight matrix; and the probability matrix in each layer of neural network comprises a first probability matrix and a second probability matrix, wherein the first probability matrix corresponds to the first reference weight matrix, and the second probability matrix corresponds to the second reference weight matrix. Based on the probability values of the same position elements of the first reference weight matrix and the second reference weight matrix, selecting the position elements in the first reference weight matrix or the second reference weight matrix as the position elements in the binary weight matrix; obtaining a binary weight matrix corresponding to each layer of neural network in the word training process based on the rule; and then, calculating based on the binary weight matrix to obtain a second convolution output result of each layer of neural network corresponding to the embodiment and a target loss function value in each training process, thereby ensuring the correct training process and further obtaining an optimal binary neural network model.
In a possible embodiment, the obtaining a second convolution output result of the neural network of the ith layer according to the binary input matrix and the binary weight matrix of the j +1 th batch of images in the neural network of the ith layer includes: respectively performing convolution operation on a binary input matrix and a binary weight matrix of each image in the (j + 1) th batch of images in the ith layer of neural network to obtain a reference feature matrix of each image; and scaling the reference characteristic matrix of each image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
It can be seen that, in the embodiment of the application, the reference feature matrix of each image is obtained based on the binary input matrix and the binary weight matrix in each layer of neural network; and then, scaling the reference characteristic matrix of each image by using the weight scaling scale factor of each layer of neural network to obtain a second convolution output result. The second convolution output result represents the characteristics of the input image, and therefore the convolution result loss item is designed in the target loss function, so that the trained target binary neural network model can keep the characteristic extraction capability of the teacher network as much as possible, and the prediction accuracy of the target binary neural network model is improved.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
It can be seen that, in the embodiment of the present application, at least one of the probability matrix or the weight scaling scale factor in each layer of the neural network is updated in each training back propagation process, so that in the next training forward propagation process, the binary weight matrix in each layer of the neural network is updated, and further, the image prediction value and the loss function value output by the training model are obtained based on the updated binary weight matrix and/or the weight scaling scale factor, and the model parameters are further adjusted based on the obtained loss function value, thereby ensuring that the student network with the performance close to that of the teacher network is obtained.
In a second aspect, the present application provides a model training method, where the model includes a teacher network and a student network, the teacher network is a trained neural network model, the student network is a binary neural network model, the teacher network and the student network respectively include N layers of neural networks, and N is a positive integer, the method including: training the binary neural network model by using a teacher network and a target loss function; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of the ith layer of neural network in the student network; i is a positive integer less than or equal to N; and repeatedly executing the steps until an iteration termination condition is met, and obtaining a target binary neural network model.
In one possible embodiment, the target loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix in the ith layer of neural network in the student network.
In one possible embodiment, the objective loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
In a possible implementation, the training of the binary neural network model using the teacher network and the target loss function includes: inputting the training image into a binary neural network model to obtain a predicted value of the training image; and updating parameters in the binary neural network model based on the predicted values of the training images, the labels of the training images and the target loss function.
In a possible implementation, the inputting the training image into the binary neural network model to obtain the predicted value of the training image includes: p1: obtaining a binary weight matrix of the ith layer of neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer of neural network in the binary neural network model; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix from the element at any position in the reference weight matrix; p2: obtaining a second convolution output result of the ith layer of neural network according to the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network;
p3: and (5) enabling i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the training image based on a second convolution output result of the Nth-layer neural network.
In one possible embodiment, the reference weight matrix comprises a first reference weight matrix and a second reference weight matrix, and the probability matrix comprises a first probability matrix and a second probability matrix; obtaining a binary weight matrix of the ith layer of neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer of neural network in the binary neural network model, wherein the binary weight matrix comprises: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
In a possible implementation manner, the obtaining a second convolution output result of the ith layer neural network according to the binary weight matrix and a binary input matrix of the training image in the ith layer neural network includes: performing convolution operation on the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network to obtain a reference characteristic matrix of the training image; and scaling the reference characteristic matrix of the training image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
It should be understood that beneficial effects of the embodiments in the second aspect may correspond to the description with reference to the corresponding embodiments in the first aspect, and are not described herein again.
In a third aspect, the present application provides an image processing method, including: acquiring an image to be processed; performing image processing on the image to be processed by using the target binary neural network model to obtain a predicted value of the image to be processed; the target binary neural network model is obtained through K times of training, and in the j +1 th time of training in the K times of training: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Binary neural network model M j A student network in a knowledge distillation framework; a teacher network in the knowledge distillation frame is a trained neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer; the objective loss function contains an angle loss term; k is a positive integer, j is an integer greater than or equal to zero and less than or equal to K; the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix corresponding to the ith layer of neural network in the teacher network and an input matrix of the (j + 1) th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
It can be seen that, in the embodiment of the application, since the method in the first aspect introduces a knowledge distillation framework during training and introduces a corresponding angle loss term into a target loss function, the model accuracy of the target binary neural network model obtained by training through the method in the first aspect is greatly improved compared with the existing binary neural network model; meanwhile, compared with a teacher network, the binary neural network occupies a smaller storage space and is lighter, so that the binary neural network is more suitable for being used on embedded equipment and has a wider application prospect.
In one possible embodiment, the image processing includes at least one of image classification, object detection, or image segmentation.
It can be seen that the method in the embodiment of the present application can be used in any task of image classification, target detection and image segmentation, and the image processing method in the embodiment of the present application can be applied to the three tasks to improve the image processing effect, that is, the model has good universality.
In a fourth aspect, the present application provides an image processing method, including: acquiring an image to be processed; performing image processing on the image to be processed by using the target binary neural network model to obtain a predicted value of the image to be processed; wherein the target binary neural network model is an initial binary neural network model M in a knowledge distillation framework through a target loss function 0 Training the obtained, initial binary neural network model M 0 The network is a student network in the knowledge distillation frame, and a teacher network in the knowledge distillation frame is a trained neural network model; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between the included angle between the feature matrix and the weight matrix in the teacher network and the included angle between the feature matrix and the weight matrix in the student network.
It can be seen that, in the embodiment of the application, since the method in the first aspect introduces a knowledge distillation framework during training and introduces a corresponding angle loss term into a target loss function, the model accuracy of the target binary neural network model obtained by training through the method in the first aspect is greatly improved compared with the existing binary neural network model; meanwhile, compared with a neural network model in a teacher network, the binary neural network has smaller storage space occupied by model parameters and lighter weight, so that the binary neural network has a good application prospect in embedded equipment.
In a fifth aspect, the present application provides a training apparatus for a binary neural network model, the apparatus comprising: a determination unit for performing step S1. A training unit for executing step S2. And a decision unit for executing the step S3. Step S1: determining a knowledge distillation framework; it is composed ofIn the knowledge distillation framework, a teacher network is a trained neural network model, and a student network is an initial binary neural network model M 0 The teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer. Step S2: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Wherein, the binary neural network model M j Is obtained based on jth image training, wherein j is a positive integer; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of a j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N. And step S3: when a preset condition is met, the binary neural network model M is used j+1 As a target binary neural network model; otherwise let j = j +1 and repeat step S2.
In a possible embodiment, the objective loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network.
In a possible embodiment, the objective loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the neural network at the ith layer in the teacher network and the binary weight matrix of the neural network at the ith layer in the student network.
In a possible embodiment, the training unit is specifically configured to: inputting the j +1 th batch of images into a binary neural network model M j Obtaining a predicted value of the j +1 th batch of images; updating a binary neural network model M based on a predicted value of a j +1 th batch of images, a label of the j +1 th batch of images and a target loss function j Obtaining parameters of each layer of neural network to obtain a binary neural network model M j+1
In one possible embodiment, the (j + 1) th batch of images is input into the binary neural network model M j And in the aspect of obtaining the predicted value of the j +1 th batch of images, the training unit is specifically configured to: p1: model M based on binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using the reference weight matrix and the probability matrix corresponding to the ith layer of neural network; p2: obtaining a second convolution output result of the ith layer of neural network according to the binary input matrix and the binary weight matrix of the (j + 1) th batch of images in the ith layer of neural network; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix from the element at any position in the reference weight matrix; p3: and (5) repeating the steps P1-P2 by enabling i = i +1, and obtaining a predicted value of the j +1 th batch of images based on a second convolution output result of the Nth layer neural network.
In a possible embodiment, the reference weight matrix includes a first reference weight matrix and a second reference weight matrix, and the probability matrix includes a first probability matrix and a second probability matrix; based on a binary neural network model M j In the aspect that the reference weight matrix and the probability matrix corresponding to the ith layer of neural network obtain a binary weight matrix of the ith layer of neural network, the training unit is specifically configured to: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, the firstAny position element in a probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
In a possible implementation manner, in the aspect of obtaining the second convolution output result of the i-th layer neural network according to the binary input matrix and the binary weight matrix of the j + 1-th batch of images in the i-th layer neural network, the training unit is specifically configured to: respectively performing convolution operation on a binary input matrix and a binary weight matrix of each image in the (j + 1) th batch of images in the ith layer of neural network to obtain a reference characteristic matrix of each image; and scaling the reference characteristic matrix of each image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
In a sixth aspect, the present application provides a model training apparatus, the model including a teacher network and a student network, the teacher network being a trained neural network model, the student network being a binary neural network model, the teacher network and the student network respectively including N layers of neural networks, N being a positive integer, the apparatus comprising: the training unit is used for training the binary neural network model by utilizing the teacher network and the target loss function; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of the ith layer of neural network in the student network; i is a positive integer less than or equal to N; and the decision unit is used for repeatedly executing the steps until an iteration termination condition is met to obtain a target binary neural network model.
In one possible embodiment, the target loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix in the ith layer of neural network in the student network.
In one possible embodiment, the target loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
In one possible embodiment, in the training of the binary neural network model using the teacher network and the target loss function, the training unit is specifically configured to: inputting the training image into a binary neural network model to obtain a predicted value of the training image; and updating parameters in the binary neural network model based on the predicted values of the training images, the labels of the training images and the target loss function.
In a possible implementation manner, in the aspect that the training image is input into the binary neural network model to obtain the predicted value of the training image, the training unit is specifically configured to: p1: obtaining a binary weight matrix of the ith layer of neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer of neural network in the binary neural network model; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix from the element at any position in the reference weight matrix; p2: obtaining a second convolution output result of the ith layer of neural network according to the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network; p3: and (5) enabling i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the training image based on a second convolution output result of the Nth-layer neural network.
In one possible embodiment, the reference weight matrix comprises a first reference weight matrix and a second reference weight matrix, and the probability matrix comprises a first probability matrix and a second probability matrix; in the aspect of obtaining a binary weight matrix of an ith layer neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer neural network in a binary neural network model, the training unit is specifically configured to: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
In a possible implementation manner, in the aspect that the second convolution output result of the i-th layer neural network is obtained according to the binary weight matrix and the binary input matrix of the training image in the i-th layer neural network, the training unit is specifically configured to: performing convolution operation on the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network to obtain a reference characteristic matrix of the training image; and scaling the reference characteristic matrix of the training image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
In a seventh aspect, the present application provides an image processing apparatus comprising: the acquisition unit is used for acquiring an image to be processed; a processing unit for utilizing a target binary neural network modelCarrying out image processing on the image to be processed to obtain a predicted value of the image to be processed; wherein, the target binary neural network model is obtained through K times of training, and in the j +1 th time of training in the K times of training: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Binary neural network model M j Is a student network in a knowledge distillation framework; a teacher network in the knowledge distillation frame is a trained neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer; the target loss function contains an angle loss term; k is a positive integer, j is an integer greater than or equal to zero and less than or equal to K; the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix corresponding to the ith layer of neural network in the teacher network and an input matrix of the (j + 1) th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
In one possible embodiment, the image processing includes at least one of image classification, object detection, or image segmentation.
In an eighth aspect, the present application provides an image processing apparatus comprising: the acquisition unit is used for acquiring an image to be processed; the processing unit is used for carrying out image processing on the image to be processed by utilizing the target binary neural network model to obtain a predicted value of the image to be processed; wherein the target binary neural network model is an initial binary neural network model M in a knowledge distillation framework through a target loss function 0 Training the obtained, initial binary neural network model M 0 The knowledge distillation framework is a student network in the knowledge distillation framework, and a teacher network in the knowledge distillation framework is a trained neural network model; the target loss function includes an angle loss term that describes a feature matrix in the teacher network andthe difference between the included angle between the weight matrices and the included angle between the feature matrix and the weight matrix in the student network.
In a ninth aspect, the present application provides a model training apparatus comprising a processor and a memory, the memory being configured to store program instructions, the processor being configured to invoke the program instructions to perform the method of any one of the first or second aspects.
In a tenth aspect, the present application provides a chip system, which includes a processor and a memory; wherein the memory is used for storing a target binary neural network model and program instructions; the target binary neural network model is obtained by training based on the method of any one of the first aspect and the second aspect; a processor for reading the program instructions to invoke a target binary neural network model to perform the method of any one of the third or fourth aspects.
The chip may be a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC).
In an eleventh aspect, the present application provides a terminal device, where the terminal device includes the system-on-chip of the tenth aspect, and a discrete device coupled to the system-on-chip; wherein, terminal equipment includes car, camera, computer, cell-phone or wearable equipment.
In a twelfth aspect, the present application provides a computer-readable storage medium storing program code for execution by an apparatus, the program code comprising instructions for performing the method of any of the first, second, third or fourth aspects described above.
In a thirteenth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the first, second, third or fourth aspects described above.
Drawings
The drawings used in the embodiments of the present application are described below.
Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a backbone network according to an embodiment of the present application;
fig. 3 is a schematic diagram of a chip hardware structure according to an embodiment of the present disclosure;
FIG. 4 is a block diagram of another system architecture according to an embodiment of the present application;
fig. 5 is a schematic diagram of a convolution operation process in a binary neural network according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of the structure of a knowledge distillation framework provided by the examples of the present application;
FIG. 7 is a flowchart illustrating a training method of a binary neural network model according to an embodiment of the present disclosure;
FIG. 8 is a schematic flow chart diagram illustrating another model training method provided in embodiments of the present application;
FIGS. 9-A through 9-C are schematic diagrams of extracted feature distributions of different network models provided by embodiments of the present application;
FIG. 10 is a schematic diagram illustrating an angle between a feature matrix and a weight matrix in an embodiment of the present application;
fig. 11 is a schematic flowchart of an image processing method according to an embodiment of the present application;
FIG. 12 is a schematic flowchart of another image processing method according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a training apparatus for a binary neural network model according to an embodiment of the present disclosure;
FIG. 14 is a schematic diagram of a model training apparatus according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;
FIG. 16 is a schematic hardware configuration diagram of a model training apparatus according to an embodiment of the present application;
fig. 17 is a schematic diagram of a hardware structure of an image processing apparatus according to an embodiment of the present application.
Detailed Description
The embodiments of the present application will be described below with reference to the drawings.
The method and the device can be applied to basic processing tasks in computer vision such as image classification, image segmentation, target detection and the like, for example, picture detection, album management, video recording, safe cities, human-computer interaction and other scenes needing image processing.
It should be understood that the image in the embodiment of the present application may be a still image (or referred to as a still picture) or a dynamic image (or referred to as a dynamic picture), for example, the image in the present application may be a video or a dynamic picture, or the image in the present application may also be a still picture or a photo. For convenience of description, the present application collectively refers to a still image or a moving image as an image in the following embodiments.
The method of the embodiment of the application can be specifically applied to photo album management and target detection scenes, and the two scenes are described in detail below.
Managing the photo album:
a great number of images may be stored in an album of a user's terminal device, such as a mobile phone, for example, a great number of images obtained by taking pictures with a camera, capturing pictures, or downloading from a network. When a user needs to find out an image needed by the user from a large amount of image data, the method in the embodiment of the present application may be used to classify a large amount of images in an album, and different types of images are stored in different directories, such as an animal type, a landscape type, a person type, and the like, wherein the animal type may also be subdivided into different subclasses, for example, the animal in the image is identified according to the specific animal type in the image, and the subclass to which the image belongs is classified.
Therefore, the method can help the user to quickly and accurately locate the category of the image which the user wants to find, so that the time of the user is saved, and the user experience is improved.
Target detection:
the target detection is to find out an object of interest from the image and determine the position and size of the object. For example, a user wants to find some images containing cats in an album of the terminal device of the user, and at this time, the method in the embodiment of the present application may be adopted to identify all images containing cats in the terminal device of the user, so that the user can select the images.
Therefore, the method in the embodiment of the application can be used for accurately detecting the target in the image, so that the image containing the object which is interested by the user is screened, and the user experience is improved.
It should be understood that the above-described album management and object detection are only two specific scenarios to which the method of the embodiment of the present application is applied, and the method of the embodiment of the present application is not limited to the above two scenarios, and the method of the embodiment of the present application can be applied to any scenario requiring image processing, for example, image segmentation. Alternatively, the method in the embodiment of the present application may also be similarly applied to other fields, for example, speech recognition, natural language processing, and the like, which is not limited in the embodiment of the present application.
The method provided by the application is described from the model training side and the model application side as follows:
the training method of the binary neural network model provided by the embodiment of the application relates to the processing of computer vision, and can be particularly applied to data processing methods such as data training, machine learning, deep learning and the like, and the training data (such as the image to be processed in the application) is subjected to symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like, so that the trained target binary neural network model is finally obtained; in addition, the image processing method provided in the embodiment of the present application may use the trained target binary neural network model to input data (e.g., the image to be processed in the present application) into the trained target binary neural network model, so as to obtain output data (e.g., a predicted value of the image to be processed in the present application). It should be noted that the training method and the image processing method of the binary neural network model provided in the embodiment of the present application are inventions based on the same concept, and can also be understood as two parts in a system or two stages of an overall process: such as a model training phase and a model application phase.
The embodiments of the present application relate to a large number of related applications of neural networks, and in order to better understand the solution of the embodiments of the present application, the following first introduces related terms and concepts of the neural networks and the computer vision field to which the embodiments of the present application may relate.
(1) Image classification
And judging the object of which category is contained in the image or the video to be processed.
(2) Target detection
All objects of interest (objects) are identified from a given image to be processed and their category and location are determined. Because various objects have different appearances, shapes and postures and are interfered by factors such as illumination, shielding and the like during imaging, target detection is one of the most challenging problems in the field of computer vision.
(3) Image segmentation
The image segmentation is divided into instance segmentation and scene segmentation, and the image segmentation is mainly used for judging which target or object each pixel point in the image to be processed belongs to.
(4) Neural network
The neural network may be composed of neural units, which may be referred to as x s And an arithmetic unit with intercept 1 as input, the output of which may be:
Figure BDA0003050583820000111
/>
wherein s =1, 2, … … n, n is a natural number greater than 1, W s Is x s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. Neural networks are networks formed by joining together a number of the above-mentioned single neural units, i.e. the output of one neural unit may be that of another neural unitAnd (4) inputting. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
(5) Deep neural network
Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:
Figure BDA0003050583820000112
wherein the content of the first and second substances,
Figure BDA0003050583820000113
is an input vector, is greater than or equal to>
Figure BDA0003050583820000114
Is an output vector +>
Figure BDA0003050583820000115
Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is simply a function of the input vector->
Figure BDA0003050583820000116
The output vector is obtained by such simple operation>
Figure BDA0003050583820000117
Due to the large number of DNN layers, the coefficient W and the offset vector @>
Figure BDA0003050583820000118
The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-level DNN, the linear coefficients for the 4 th neuron of the second level to the 2 nd neuron of the third level are defined as->
Figure BDA0003050583820000119
Superscript 3 represents the number of layers in which the coefficient W lies, and the subscripts correspond to the third layer index 2 at the output and the second layer index 4 at the input. The summary is that: the coefficient of the kth neuron at layer L-1 to the jth neuron at layer L is defined as->
Figure BDA00030505838200001110
Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.
(6) Convolutional neural network
A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.
(7) Loss function
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
(8) Back propagation algorithm
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, and aims to obtain parameters, such as a weight matrix, of the optimal super-resolution model.
(9) Pixel value
The pixel value of the image may be a red, green, blue (RGB) color value, and the pixel value may be a long integer representing a color. For example, the pixel value is 256 Red +100 Green +76blue, where Blue represents the Blue component, green represents the Green component, and Red represents the Red component. In each color component, the smaller the numerical value, the lower the luminance, and the larger the numerical value, the higher the luminance. For a grayscale image, the pixel values may be grayscale values.
(10) Entropy (English: encopy)
The certainty of an object can be expressed, the higher the certainty, the lower the entropy, and vice versa, the higher the entropy. For the classification task, if the confidence of the classification result of a picture is closer to 0 or 1, the lower the entropy of the classification result is, the closer the classification result is to 0.5, and the higher the entropy is, the uncertain classification result is represented.
(11) Knowledge distillation
Knowledge distillation is a common method for model compression, and the model compression refers to distilling out complex characteristic expression 'knowledge' learned by a teacher network with strong learning ability in a teacher-student framework and transmitting the knowledge to a student network with small parameter number and weak learning ability.
(12) Binary neural network
The binary neural network is a neural network which only uses 1 value and-1 value to represent neural network parameters (weights) and convolution operation outputs (activities) activated by nonlinear functions in the neural network, and compared with a full-precision neural network, the binary neural network can save a large amount of memory and calculation and is beneficial to the deployment of a model on resource-limited equipment.
(13) Teacher network
The teacher network in the embodiment of the application can be a neural network with the types of data such as full-precision (32-bit floating point number), semi-precision (16-bit floating point number) or common integer (8bit, 4bit,2bit integer) and the like of neural network parameters (weights) and convolution operation output (activities) activated by nonlinear functions in the neural network.
The following describes a system architecture provided by the embodiments of the present application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a system architecture 100 according to an embodiment of the present disclosure. As shown in the system architecture 100, the data acquisition device 160 is configured to acquire training data, which in this embodiment includes image data with labels, where the labels of the images may be categories corresponding to the images, or categories corresponding to objects in the images, or categories corresponding to each pixel of the images, and the categories are represented mathematically as a multi-dimensional vector.
After the training data is collected, the data collection device 160 stores the training data into the database 130, and the training device 120 trains to obtain the target model 101 (i.e., the target binary neural network model in the embodiment of the present application) based on the training data maintained in the database 130.
In the following, how the training apparatus 120 obtains the target model 101 based on the training data will be described in more detail in an embodiment, where the target model 101 can be used to implement the image processing method provided in the embodiment of the present application, that is, the image to be processed is input into the target model 101 after being subjected to the relevant preprocessing, so that the predicted value of the image to be processed can be obtained. The target model 101 in the embodiment of the present application may specifically be a target binary neural network model, and in the embodiment provided in the present application, the target binary neural network model is obtained by training an initial binary neural network model M 0 And (4) obtaining the product. It should be noted that, in practical applications, the database 130 is maintainedThe training data is not necessarily acquired by the data acquisition device 160, and may be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.
The target model 101 obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1, the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with an external device, and a user can input data to the I/O interface 112 through a client device 140, where the input data may include various image or video data in the embodiment of the present application.
In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call the data, the code, and the like in the data storage system 150 for corresponding processes, and may store the data, the instruction, and the like obtained by corresponding processes in the data storage system 150.
Finally, the I/O interface 112 returns a processing result, such as the predicted value of the to-be-processed image obtained as described above (i.e., the class label of the to-be-processed image, or the object identified from the to-be-processed image, or the result of segmenting the to-be-processed image) to the client apparatus 140, thereby providing it to the user.
It should be noted that the training device 120 may generate corresponding target models 101 for different targets or different tasks based on different training data, and the corresponding target models 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.
In the case shown in fig. 1, the user may manually specify the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.
It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the position relationship between the devices, modules, etc. shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.
As shown in fig. 1, a target model 101 is obtained by training according to a training device 120, where the target model 101 in the embodiment of the present application may be a target binary neural network model obtained by training based on a training method of a binary neural network model in the embodiment of the present application, specifically, the target binary neural network model provided in the embodiment of the present application may be a convolutional neural network or another neural network with similar functions, and this scheme is not particularly limited in this respect.
As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a Deep Learning (DL) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.
As shown in fig. 2, convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
Convolutional layer/pooling layer 220:
and (3) rolling layers:
convolutional/pooling layer 220 as shown in fig. 2 may comprise layers as in examples 221-226, for example: in one implementation, 221 layers are convolutional layers, 222 layers are pooling layers, 223 layers are convolutional layers, 224 layers are pooling layers, 225 layers are convolutional layers, 226 layers are pooling layers; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.
The internal operation of a convolutional layer will be described below by taking convolutional layer 221 as an example.
Convolution layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride) in the horizontal direction on the input image, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image in the process of performing the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.
The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.
When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.
A pooling layer:
since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, either one layer of convolutional layer followed by one pooling layer, or multiple layers of convolutional layer followed by one or more pooling layers, among the layers 221-226 as exemplified by convolutional/pooling layer 220 in FIG. 2. The only purpose of the pooling layer in the image processing process is to reduce the spatial size of the image. The pooling layer may comprise an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller size images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as a result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operator in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
The neural network layer 230:
after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of outputs using the neural network layer 230. Therefore, a plurality of hidden layers (231, 232 to 23n shown in fig. 2) and an output layer 240 may be included in the neural network layer 230, and parameters included in the plurality of hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, the task type may include image recognition, target detection, image classification, image super-resolution reconstruction, and the like.
After the hidden layers in the neural network layer 230, i.e. the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from the direction 210 to 240 in fig. 2 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e. the propagation from the direction 240 to 210 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
It should be noted that the convolutional neural network 200 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
A hardware structure of a chip provided in an embodiment of the present application is described below.
Fig. 3 is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor 50. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1, to complete the training work of the training apparatus 120 and output the target model 101. The algorithms for the various layers in the convolutional neural network shown in fig. 2 can all be implemented in a chip as shown in fig. 3.
The neural network processor NPU 50 is mounted as a coprocessor on a Host CPU (Host CPU), and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic.
In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator 508 (accumulator).
The vector calculation unit 507 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculation of non-convolution/non-FC layers in a neural network, such as Pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.
In some implementations, the vector calculation unit 507 can store the processed output vector to the unified memory 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.
The unified memory 506 is used to store input data as well as output data.
The weight data directly passes through a memory unit access controller 505 (DMAC) to transfer the input data in the external memory to the input memory 501 and/or the unified memory 506, store the weight data in the external memory in the weight memory 502, and store the data in the unified memory 506 in the external memory.
A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.
An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504.
The controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.
Generally, the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are on-chip memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM), or other readable and writable memories.
The operations of the layers in the convolutional neural network shown in fig. 2 may be performed by the operation circuit 503 or the vector calculation unit 507.
The training device 120 in fig. 1 described above can perform the steps of the method for training the binary neural network model in the embodiment of the present application, the execution device 110 in fig. 1 can perform the steps of the image processing method (e.g., image classification, image segmentation, and object detection) in the embodiment of the present application, the neural network model shown in fig. 2 and the chip shown in fig. 3 can also be used to perform the steps of the image processing method in the embodiment of the present application, and the chip shown in fig. 3 can also be used to perform the steps of the method for training the binary neural network model in the embodiment of the present application.
As shown in fig. 4, fig. 4 is a schematic structural diagram of a system architecture 300 according to an embodiment of the present disclosure. The system architecture includes a local device 301, a local device 302, and an execution device 210 and a data storage system 250; the local device 301 and the local device 302 are connected to the execution device 210 via a communication network.
The execution device 210 may be implemented by one or more servers. Optionally, the execution device 210 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The execution device 210 may be disposed on one physical site or distributed across multiple physical sites. The execution device 210 may use data in the data storage system 250 or call program code in the data storage system 250 to implement a method for training a binary neural network or an image processing method (such as an image hyper-segmentation method, an image de-noising method, an image demosaicing method, and an image deblurring method) according to the embodiment of the present application.
Specifically, the execution device 210 may perform the following process:
s1: determining a knowledge distillation framework; wherein, the teacher network in the knowledge distillation frame is a trained neural network model, and the student network in the knowledge distillation frame is an initial binary neural network model M 0 The teacher network and the student network respectively comprise N layers of neural networks, N isA positive integer; s2: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Wherein, the binary neural network model M j Is obtained based on the jth batch of image training, wherein j is a positive integer; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of a j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N; s3: when the preset condition is met, the binary neural network model M is used j+1 As a target binary neural network model; otherwise let j = j +1 and repeat step S2.
The executing device 210 can train to obtain a target binary neural network model, and the target binary neural network model can be used for image processing, speech processing, natural language processing, and the like, for example, the target binary neural network model can be used for implementing the image classification, target detection, and image segmentation methods in the embodiments of the present application.
Alternatively, the execution device 210 can be constructed by the above-described procedure as an image processing apparatus that can be used for image processing (for example, can be used to implement the image classification, object detection, and image segmentation methods in the embodiments of the present application).
The user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 210. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.
The local devices of each user may interact with the enforcement device 210 via a communication network of any communication mechanism/communication standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.
In an implementation manner, the local device 301 and the local device 302 acquire relevant parameters of the neural network from the execution device 210, deploy the neural network on the local device 301 and the local device 302, and perform image processing on the image to be processed by using the neural network to obtain a processing result of the image to be processed.
In another implementation, a neural network may be directly deployed on the execution device 210, and the execution device 210 obtains the image to be processed from the local device 301 and the local device 302, and performs image processing on the image to be processed by using the neural network, so as to obtain a processing result of the image to be processed.
In one implementation manner, the local device 301 and the local device 302 acquire relevant parameters of the image processing apparatus from the execution device 210, deploy the image processing apparatus on the local device 301 and the local device 302, and perform image processing on the image to be processed by using the image processing apparatus to obtain a processing result of the image to be processed.
In another implementation, an image processing apparatus may be directly disposed on the execution device 210, and the execution device 210 obtains the image to be processed from the local device 301 and the local device 302, and performs image processing on the image to be processed by using the image processing apparatus, so as to obtain a processing result of the image to be processed.
That is to say, the execution device 210 may also be a cloud device, and in this case, the execution device 210 may be deployed in the cloud; alternatively, the execution device 210 may also be a terminal device, in which case, the execution device 210 may be deployed at a user terminal side, which is not limited in this embodiment of the application.
The method for training a binary neural network model and the image processing method (for example, the image processing method may include image classification, target detection, and image segmentation) according to the embodiments of the present application are described in detail below with reference to the accompanying drawings.
Please refer to fig. 6, fig. 6 is a drawing of the present applicationPlease show a schematic structure of a knowledge distillation framework 600 in the embodiment. The knowledge distillation framework is the knowledge distillation framework used in the examples of the present application. The knowledge distillation framework 600 comprises a teacher network and a student network, wherein the teacher network is a trained neural network model, and the student network is an initial binary neural network model M 0 (ii) a The teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer. The knowledge distillation framework 600 is also referred to in the embodiments of the present application as a layer-wise search (LWS-Det) architecture.
It should be noted that fig. 6 shows only a partial structure of the neural network of the i-th layer in the teacher network and the student network, and the structures of the neural networks of the other layers are the same as those of the i-th layer, i being a positive integer greater than 1.
In the teacher network, the output result of the i-1 th layer neural network is input to the i-th layer neural network, and is processed by a Batch Normalization (BN) layer and an activation function (not shown in fig. 6), etc., to obtain an input matrix a of the i-th layer i-1 (ii) a Then to the input matrix a i-1 And a weight matrix w i Performing convolution operation to obtain a first convolution output result; and finally, processing by a Parametric Linear rectification function (PReLU) and a BN layer and the like to obtain an output result of the i-th layer of neural network. In the student network, the output result of the i-1 th layer neural network is input to the i-th layer neural network, and after Batch Normalization (BN) layer, activation function, binarization (not shown in fig. 6), and the like, the binary input matrix of the i-th layer is obtained
Figure BDA0003050583820000181
Based on a first reference weight matrix->
Figure BDA0003050583820000182
Second reference weight matrix->
Figure BDA0003050583820000183
The first probability matrix and the second probability matrix determine a binary weight matrix ^ of the ith layer neural network>
Figure BDA0003050583820000184
Binary weight matrix>
Figure BDA0003050583820000185
Will be described in detail in the embodiment of fig. 7 later; then on a binary weight matrix>
Figure BDA0003050583820000186
And a binary input matrix->
Figure BDA0003050583820000187
Performing convolution operation to obtain a reference characteristic matrix; finally, scale factor a is scaled by weight i And the PReLU layer, the BN layer and the like are processed to obtain an output result of the i-th layer neural network.
It should be understood that the PReLU in FIG. 6 may be replaced by other activation functions, and the present application is not limited thereto. Input matrix a in the i-th layer neural network i-1 And a weight matrix w i The specific process of performing the convolution operation can be seen in fig. 5, and is not described herein again.
Referring to fig. 7, fig. 7 is a flowchart illustrating a training method 700 for a binary neural network model according to an embodiment of the present disclosure. As shown in fig. 7, the method 700 includes step S1, step S2, and step S3.
In some examples, the method 700 may be performed by the execution device 110 in fig. 1, the chip shown in fig. 3, the execution device 210 in fig. 4, and so on.
Step S1: determining a knowledge distillation framework; wherein, the teacher network in the knowledge distillation frame is a trained neural network model, and the student network in the knowledge distillation frame is an initial binary neural network model M 0 The teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer.
The trained neural network model in the teacher network can be a neural network model with real number types including but not limited to full-precision (3)2 floating point), half precision (16 floating point), or integer (8bit, 4bit,2bit integers). Initial binary neural network model M 0 The model is obtained after the parameters in the model are initialized.
It should be understood that the embodiments of the present application can be applied to all the training processes of the binary neural network, and the present application is not limited thereto.
Step S2: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Wherein, the binary neural network model M j Is obtained based on the jth batch of image training, wherein j is a positive integer; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of a j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
Specifically, in the process of carrying out j +1 th training by using the j +1 th batch of images, the j +1 th batch of images are input into the knowledge distillation framework, namely synchronously input into a teacher network and a student network, and the teacher network is used for guiding the training of the student network. Wherein the j +1 th batch of images comprises at least one image.
For each image in the j +1 th batch of images, calculating an angle between an input matrix corresponding to each image in an ith layer of neural network in the teacher network and a weight matrix of the ith layer of neural network, namely a first angle; simultaneously calculating an angle between a binary input matrix corresponding to each image in the ith layer of neural network in the student network and a binary weight matrix of the layer of neural network, namely a second angle; the angular loss in the i-th layer neural network of each image is calculated based on formula (2). And then calculating an angle loss average value based on the angle loss of each image in the j +1 th batch of images in the i-th layer neural network, wherein the angle loss average value is an angle loss item of the j +1 th batch of images in the i-th layer neural network. And finally, accumulating the angle loss items of the j +1 th batch of images in each layer of neural network to obtain the loss value of the angle loss item in the target function in the j +1 th training.
Figure BDA0003050583820000191
Wherein the content of the first and second substances,
Figure BDA0003050583820000192
an angle loss term in the ith layer neural network for each image; cos θ i A first angle corresponding to each image; />
Figure BDA0003050583820000193
A second angle corresponding to each image; a is i-1 An input matrix in the ith neural network in the teacher network for each image; w is a i A weight matrix in the ith neural network in the teacher network; />
Figure BDA0003050583820000194
A binary input matrix in an ith layer neural network in the student network is used for each image; />
Figure BDA0003050583820000195
The weight matrix is a binary weight matrix in an ith layer neural network in a student network; />
Figure BDA0003050583820000196
Is the tensor product.
It can be seen that, in the embodiment of the present application, a trained teacher network is used to guide the training of a student network, and an angle loss item is designed in a target loss function to update parameters in the student network, so that on one hand, a feature extraction result of an input sample by the student network is close to a feature extraction result of an input sample by the teacher network, and on the other hand, an angle between a binary weight matrix and a binary input matrix in the student network is close to an angle between a weight matrix and an input matrix in the teacher network. In conclusion, compared with the angle loss after quantization is not considered in the training process of the binary neural network model in the prior art, the method and the device can enable the performance of the trained student network to be close to the performance of the teacher network to the maximum extent by introducing the knowledge distillation frame and the angle loss item in the loss function, and therefore the prediction accuracy of the target binary neural network model obtained through training in the embodiment of the application is improved.
In one possible embodiment, the target loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network.
Specifically, for each image in the j +1 th batch of images, the corresponding convolution result loss in the i-th layer neural network can be calculated by using formula (3): performing convolution operation on an input matrix of an ith layer of neural network in the teacher network and a weight matrix of the ith layer of neural network in the teacher network of each image to obtain a first convolution output result corresponding to each image; carrying out convolution operation on the binary input matrix of the ith layer of neural network in the student network and the binary weight matrix of the ith layer of neural network in the student network of each image to obtain a reference feature matrix, and carrying out element-by-element multiplication on the reference feature matrix by using the weight scaling scale factor of the ith layer of neural network in the student network to obtain a second convolution output result corresponding to each image. And subtracting the first convolution output result and the second convolution output result of each image in the ith layer of neural network, and then calculating a second norm to obtain a convolution result loss value of each image in the ith layer of neural network. Calculating a convolution result loss average value based on the convolution result loss value of each image in the (j + 1) th batch of images, wherein the convolution result loss average value is the convolution result loss of the (j + 1) th batch of images in the ith layer of neural network; and finally, accumulating the convolution result loss corresponding to the (j + 1) th batch of images in each layer of neural network to obtain the loss value of the convolution result loss item in the target function in the (j + 1) th training.
Figure BDA0003050583820000201
Wherein the content of the first and second substances,
Figure BDA0003050583820000202
loss of convolution results in the ith layer of neural network for each image; e i Calculating the convolution result loss function of each image in the ith layer of neural network; alpha is alpha i Scaling a scale factor for the weight corresponding to the ith layer of neural network; an h-Hadamard product (Hadamard product), indicating multiplication of corresponding position elements in the two matrices; multiplying element by element; the physical meanings of the rest parameters in the formula (3) are explained in the formula (2), and are not described in detail here.
It can be seen that, in the embodiment of the application, the convolution result loss term is introduced into the target loss function to enable the second convolution output result in the student network to be as close as possible to the first convolution output result in the teacher network, that is, the output result of each layer of neural network in the student network is as close as possible to the output result of each layer of neural network in the teacher network, so that the output predicted value of the student network is ensured to be close to the predicted value of the teacher network, and the prediction accuracy of the target binary neural network model obtained after training is improved.
In one possible embodiment, the target loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
Specifically, in the j +1 training, the weight loss corresponding to the ith layer neural network can be calculated by using formula (4): the method comprises the following steps of multiplying a binary weight matrix of an ith layer of neural network in a student network element by using a weight scaling scale factor of the ith layer of neural network to obtain a first weight matrix; and then subtracting the first weight matrix from the weight matrix of the ith layer of neural network in the teacher network, and calculating a two-norm to obtain the weight loss corresponding to the ith layer of neural network. And calculating according to the steps to obtain the weight loss corresponding to each layer of neural network in the (j + 1) th training, and accumulating the weight loss of each layer of neural network to obtain the loss value of the weight loss item in the target loss function in the (j + 1) th training.
Figure BDA0003050583820000203
Wherein the content of the first and second substances,
Figure BDA0003050583820000211
the weight loss corresponding to the ith layer of neural network in the (j + 1) th training is obtained; the physical meanings of the rest parameters are explained in the formula (2) and the formula (3) and are not described in detail here.
It can be seen that, in the embodiment of the application, a weight loss term representing the difference between the weight matrix in the teacher network and the binary weight matrix in the student network is introduced into the target loss function, and the weight loss term, the convolution result loss term and the angle loss term are used for training the student network together. The performance of the target neural network model obtained after training can be improved to the maximum extent by introducing the three model performance measurement indexes into the target loss function to train the student network, so that the performance of the target neural network model is close to the performance of the neural network to the maximum extent.
In one possible embodiment, the binary neural network model M is trained using the j +1 th batch of images and the target loss function j To obtain a binary neural network model M j+1 The method comprises the following steps: inputting the j +1 th batch of images into a binary neural network model M j Obtaining a predicted value of the j +1 th batch of images; based onUpdating a binary neural network model M by using a predicted value of a j +1 batch of images, a label of a j +1 batch of images and a target loss function j Obtaining parameters in each layer of neural network to obtain a binary neural network model M j+1
Specifically, the target loss function further includes a detection loss term, where the detection loss term is obtained based on a predicted value of the image of the j +1 th batch and a label of the image of the j +1 th batch. The specific calculation process for detecting the loss term is as follows: and calculating the difference value between the predicted value and the label of each image in the (j + 1) th batch of images, and then taking the average value to obtain the loss value of the detection loss item in the target loss function in the (j + 1) th training process. It should be appreciated that the above-described objective loss function may include one or more of an angle loss term, a convolution result loss term, a weight loss term, or a detection result loss term. And calculating a loss function value in the j +1 training process based on the target loss function, wherein the smaller the loss function value is, the closer the performances of the student network and the teacher network are. Updating the binary neural network model M in such a way that the values of the objective loss function become smaller and smaller j Obtaining parameters in each layer of neural network to obtain a binary neural network model M j+1
Further, optionally, the binary neural network model M is updated j When the parameters in each layer of neural network are updated, the parameters of each layer of neural network can be updated layer by layer according to the sequence from the nth layer of neural network to the 1 st layer of neural network, which is not limited in the present application.
It should be understood that after the forward propagation process of one training is finished, the embodiment of the present application may obtain an image prediction value based on the end of the forward propagation process; and then calculating a loss function value in the training process based on the image predicted value, the image label and the target loss function, and updating the model parameters based on the target loss function value.
In one possible embodiment, the (j + 1) th batch of images is input into the binary neural network model M j And obtaining a predicted value of the j +1 th batch of images, wherein the method comprises the following steps: p1: model M based on binary neural network j Obtaining the first layer of reference weight matrix and probability matrix corresponding to the ith layer of neural networkA binary weight matrix of the i-layer neural network; p2, obtaining a second convolution output result of the ith layer neural network according to the binary input matrix and the binary weight matrix of the j +1 th batch of images in the ith layer neural network; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix, and the element at any position in the reference weight matrix; p3: and (5) enabling i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the j +1 th batch of images based on a second convolution output result of the Nth layer neural network.
Specifically, in the forward propagation process of the ith layer of neural network in the (j + 1) th training, a binary weight matrix of the ith layer of neural network is obtained based on a reference weight matrix and a probability matrix corresponding to the ith layer of neural network, and the binary weight matrix is a model parameter in the ith layer of neural network of the student network. For each image in the (j + 1) th batch of images, when the current direction is propagated to the Nth-layer neural network, obtaining a second convolution output result of each image based on the binary weight matrix of the Nth-layer neural network and the corresponding binary input matrix of each image in the Nth-layer neural network; then inputting a second convolution output result of each image into structures such as an activation function, a BN layer and a fully-connected Softmax layer to obtain a binary neural network model M j And (4) a predicted value of each image in the j +1 th batch of images.
It should be understood that the activation function may be a Linear rectification function (ReLU), a prerlu, or other possible activation functions, which is not limited in this application.
It can be seen that, in the embodiment of the present application, in the forward propagation process of each layer of neural network during each training, a binary weight matrix of each layer of neural network is determined according to a reference weight matrix and a probability matrix corresponding to each layer of neural network, and then a second convolution output result of each layer of neural network is obtained through calculation based on a binary input matrix and a binary weight matrix of each layer of neural network, and further, when the forward propagation is performed to an nth layer of neural network, an image prediction value output by a model during the training process can be obtained through calculation according to the second convolution output result of the nth layer of neural network. And in the subsequent back propagation process, calculating a loss function value in the training process by using a second convolution output result and a first convolution output result obtained in the forward propagation process, an image prediction value and an image label, and further adjusting the model parameters according to the loss function value to ensure that an optimal target binary neural network model is obtained.
In a possible embodiment, the reference weight matrix includes a first reference weight matrix and a second reference weight matrix, and the probability matrix includes a first probability matrix and a second probability matrix; model M based on binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using the reference weight matrix and the probability matrix corresponding to the ith layer of neural network, wherein the binary weight matrix comprises the following steps: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
Optionally, in the training process of the (j + 1) th time, a Differentiable Binary Search (DBS) method may be adopted to Search out a binary weight matrix of the ith layer neural network in the student network from the Search space O; the search space can be a first reference weight matrix and a second reference weight matrix corresponding to an ith layer of neural network in the student network. Further, optionally, the elements in the first reference weight matrix and the second reference weight matrix are all 1 or-1, when the elements in the first reference weight matrix are all 1, the elements in the second reference weight matrix are all-1, and when the elements in the first reference weight matrix are all-1, the elements in the second reference weight matrix are all 1. The first reference weight matrix corresponds to the first probability matrix and the second reference weight matrix corresponds to the second probability matrix. It should be understood that the widths and heights of the first reference weight matrix, the second reference weight matrix, the first probability matrix, and the second probability matrix described above are the same.
Specifically, the element at any position of the binary weight matrix in the ith layer neural network in the student network is determined according to the corresponding first probability value of the element at any position of the first reference weight matrix in the first probability matrix and the corresponding first probability value of the element at any position of the second reference weight matrix in the second probability matrix. Further, optionally, an element of the reference weight matrix corresponding to the higher probability value of the first probability value and the second probability value at the any position may be used as an element of the binary weight matrix at the any position.
For example, the first reference weight matrix and the second reference weight matrix in the i-th layer neural network may be as shown in fig. 6 respectively
Figure BDA0003050583820000221
And &>
Figure BDA0003050583820000222
The first probability matrix and the second probability matrix may each be ≦ illustrated in FIG. 6>
Figure BDA0003050583820000223
And &>
Figure BDA0003050583820000224
When +>
Figure BDA0003050583820000225
Is greater than or equal to>
Figure BDA0003050583820000226
Will->
Figure BDA0003050583820000227
In a first reference weight matrix->
Figure BDA0003050583820000228
Corresponding bitA placed element, i.e., -1, is determined as the binary weight matrix +>
Figure BDA0003050583820000229
The elements of the first row and the first column in the first row, a binary weight matrix is determined in turn according to this rule>
Figure BDA00030505838200002210
Each element of (a).
Optionally, before determining the binary weight matrix of the i-th layer neural network, normalizing the first probability matrix and the second probability matrix with reference to formula (5); and then determining a binary weight matrix corresponding to the i-th layer neural network according to the formula (6).
Figure BDA00030505838200002211
Figure BDA00030505838200002212
In the equations (5) and (6),
Figure BDA00030505838200002213
is a normalized probability matrix; operation o k ,o′ k E is O; s.t. means satisfied;
Figure BDA00030505838200002214
represents operation o k The first probability value and the second probability value corresponding to the l weight in the binary weight matrix corresponding to the ith layer of neural network; />
Figure BDA00030505838200002215
Representing taking the maximum value of the first probability value and the second probability value; />
Figure BDA00030505838200002216
Corresponding binary weight matrix for the i-th layer neural network>
Figure BDA0003050583820000231
Element in the l-th position.
It can be seen that, in the embodiment of the present application, the reference weight matrix in each layer of the neural network in the student network includes a first reference weight matrix and a second reference weight matrix; and the probability matrix in each layer of neural network comprises a first probability matrix and a second probability matrix, wherein the first probability matrix corresponds to the first reference weight matrix, and the second probability matrix corresponds to the second reference weight matrix. Based on the probability values of the same position elements of the first reference weight matrix and the second reference weight matrix, selecting the position element in the first reference weight matrix or the second reference weight matrix as the position element in the binary weight matrix; obtaining a binary weight matrix corresponding to each layer of neural network in the word training process based on the rule; and then, calculating based on the binary weight matrix to obtain a second convolution output result of each layer of neural network corresponding to the embodiment and a target loss function value in each training process, so as to ensure the correct training process and further obtain an optimal binary neural network model.
In a possible implementation manner, the obtaining a second convolution output result of the ith layer neural network according to the binary input matrix and the binary weight matrix of the (j + 1) th batch of images in the ith layer neural network includes: respectively performing convolution operation on a binary input matrix and a binary weight matrix of each image in the (j + 1) th batch of images in the ith layer of neural network to obtain a reference feature matrix of each image; and scaling the reference characteristic matrix of each image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
Specifically, as shown in fig. 6, in the i-th layer neural network of the student network, for each image in the j + 1-th batch of images, after the operations of activating, normalizing, binarizing and the like are performed on the input matrix of each image in the i-th layer neural network, the binary input matrix of each image in the i-th layer neural network is obtained
Figure BDA0003050583820000232
Performing convolution operation on the binary input matrix of each image and the binary weight matrix of the ith layer of neural network to obtain a reference characteristic matrix of each image; finally, scaling the scale factor alpha by using the weight of the ith layer neural network i And multiplying the reference characteristic matrix of each image element by element to obtain a second convolution output result of each image.
It can be seen that, in the embodiment of the present application, the reference feature matrix of each image is obtained based on the binary input matrix and the binary weight matrix in each layer of neural network; and then, scaling the reference characteristic matrix of each image by using the weight scaling scale factor of each layer of neural network to obtain a second convolution output result. The second convolution output result represents the characteristics of the input image, and therefore the convolution result loss item is designed in the target loss function, so that the trained target binary neural network model can keep the characteristic extraction capability of the teacher network as much as possible, and the prediction accuracy of the target binary neural network model is improved.
In a possible embodiment, the above parameter comprises at least one of the probability matrix or the weight scaling scale factor.
Specifically, in the (j + 1) th training process, after the loss function value in the (j + 1) th training process is calculated by referring to the above embodiment, at least one of the probability matrix or the weight scaling scale factor in each layer of the neural network in the student network may be adjusted according to a trend of gradually decreasing the target loss function value.
It can be seen that, in the embodiment of the present application, at least one of the probability matrix and the weight scaling scale factor in each layer of neural network is updated in each training back propagation process, so that in the next training forward propagation process, the binary weight matrix in each layer of neural network is updated, and further, the image prediction value and the loss function value output by the training model are obtained based on the updated binary weight matrix and/or the weight scaling scale factor, and the model parameters are further adjusted based on the obtained loss function value, thereby ensuring that the student network with the performance close to that of the teacher network is obtained.
Alternatively, when the number of images included in the j +1 th batch of images is one, the expression of the target loss function may be as shown in formula (7):
Figure BDA0003050583820000233
wherein L is a target loss function; l is GT Loss terms are the detection results;
Figure BDA0003050583820000234
is an angle loss term; />
Figure BDA0003050583820000235
A loss term for the convolution result; />
Figure BDA0003050583820000241
Is a weight loss term; λ L Lim The fine texture loss item is used for limiting the auxiliary learning of the detection head; λ, μ, and γ are hyper-parameters in the model, which may be optionally set to 0.01, and 0.0001, respectively, which is not limited in this application.
And step S3: when a preset condition is met, the binary neural network model M is used j+1 As a target binary neural network model; otherwise let j = j +1 and repeat step S2.
Optionally, after the j +1 th training is finished, when any one of the following preset conditions is met, taking the binary neural network model Mj +1 as a target binary neural network model:
the preset condition one: the number of training pictures used in the first j +1 training process reaches a preset number. The preset number may be the total number of pictures in the training set or any other number, which is not limited in the present application.
The preset condition two is as follows: and the value of the target loss function in the j +1 training process is smaller than a preset value. The preset value may be set according to a specific scenario, which is not limited in this application.
The preset condition is three: the prediction accuracy obtained based on the predicted values and the labels of the multiple images obtained in the j +1 th training process is higher than a preset ratio. The preset value may be set according to a specific scenario, which is not limited in this application.
It can be seen that, in the embodiment of the present application, a teacher network trained in a knowledge distillation framework is used to guide a training process of a student network, and an angle loss term, a convolution result loss term, and a weight loss term are designed in an objective loss function to jointly update parameters in the student network, specifically: and minimizing the angle loss after the target binary neural network obtained by training is quantized through an angle loss term in the target loss function, and minimizing the amplitude loss after the target binary neural network obtained by training is quantized through a convolution result loss term and a weight loss term. Therefore, the performance of the trained target binary neural network model is close to the performance of a teacher network to the maximum extent, and the prediction precision of the trained target binary neural network model in the embodiment of the application is improved.
Referring to fig. 8, fig. 8 is a flowchart illustrating another model training method 800 according to an embodiment of the present disclosure. The method 800 includes step S810 and step S820. The model comprises a teacher network and a student network, wherein the teacher network is a trained neural network model, the student network is a binary neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer.
Step S810, training a binary neural network model by using a teacher network and a target loss function; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
And step 820, repeatedly executing the step 810 until an iteration termination condition is met, and obtaining a target binary neural network model.
In one possible embodiment, the target loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix in the ith layer of neural network in the student network.
In one possible embodiment, the target loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the neural network at the ith layer in the teacher network and the binary weight matrix of the neural network at the ith layer in the student network.
In a possible implementation, the training of the binary neural network model using the teacher network and the target loss function includes: inputting the training image into a binary neural network model to obtain a predicted value of the training image; and updating parameters in the binary neural network model based on the predicted values of the training images, the labels of the training images and the target loss function.
In a possible implementation, the inputting the training image into the binary neural network model to obtain the predicted value of the training image includes: p1: obtaining a binary weight matrix of the ith layer of neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer of neural network in the binary neural network model; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix, and the element at any position in the reference weight matrix; p2: obtaining a second convolution output result of the ith layer of neural network according to the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network; p3: and i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the training image based on a second convolution output result of the Nth layer neural network.
In one possible embodiment, the reference weight matrix comprises a first reference weight matrix and a second reference weight matrix, and the probability matrix comprises a first probability matrix and a second probability matrix; obtaining a binary weight matrix of the ith layer neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer neural network in the binary neural network model, wherein the binary weight matrix comprises the following steps: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
In a possible implementation manner, the obtaining a second convolution output result of the ith layer neural network according to the binary weight matrix and a binary input matrix of the training image in the ith layer neural network includes: performing convolution operation on the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network to obtain a reference characteristic matrix of the training image; and scaling the reference characteristic matrix of the training image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
It should be understood that the training image in the above embodiment of fig. 8 may be understood as an image used in each training process in fig. 7, and the specific training process in the embodiment of fig. 8 may refer to the description of the corresponding process in fig. 7, which is not described herein again.
Please refer to fig. 9-a to 9-C, and fig. 9-a to 9-C are schematic diagrams illustrating distribution of extracted features of different network models according to embodiments of the present application. FIG. 9-A is a diagram of a first layer neural network and a last layer neural network in a teacher network for extracting features; FIG. 9-B is a distribution diagram of extracted features of a first layer neural network and a last layer neural network in a binary neural network in an embodiment of the present application; fig. 9-C is a distribution diagram of extracted features of the first layer of neural network and the last layer of neural network in the binary neural network obtained by using An Efficient binary Object Detector (BiDet) method.
It can be seen that the distribution of the binary neural network extracted features obtained by the method of the embodiment of the present application is close to the distribution of the teacher network extracted features, and the distribution of the binary neural network extracted features obtained by the method of the present application is greatly different from the distribution of the teacher network extracted features, which indicates that the performance of the binary neural network model obtained by training by the method of the embodiment of the present application is close to that of the teacher network model, and the prediction accuracy is higher than that of other binary neural network models in the prior art.
Referring to fig. 10, fig. 10 is a schematic diagram illustrating an included angle between a feature matrix and a weight matrix in an embodiment of the present application. The feature matrix angle refers to an angle between the weight matrix and the input matrix in any layer of the neural network in the above embodiments. In fig. 10, the feature matrices in the teacher network and the binary neural network are simplified into three-dimensional vectors for representation. Fig. 10 (a) shows the angle between the weight vector w and the input vector a in the teacher network, and the respective magnitudes. Fig. 10 (d) shows the angle between the binary weight vector and the binary input vector of the binary neural network in the embodiment of the present application, and the respective magnitudes. Fig. 10 (b) and (c) respectively represent the angle between the binary weight vector and the binary input vector in the rest of the binary neural network and the respective amplitude. Comparing (b) (c) and (d) in fig. 10 with (a), respectively, it can be seen that: binary weight vector quantized by the method in the embodiment of the application
Figure BDA0003050583820000261
And a binary input vector->
Figure BDA0003050583820000262
The included angle therebetween is greater or smaller>
Figure BDA0003050583820000263
And the respective magnitudes, are substantially the same as the angles θ and magnitudes between the corresponding vectors in the teacher network, i.e., (d) in fig. 10. The resulting binary weight vector quantized using the method shown in FIG. 10 (b)>
Figure BDA0003050583820000264
And a binary input vector->
Figure BDA0003050583820000265
Coincidence occurs, and the amplitude of the binary weight vector obtained after quantization is larger than the amplitude of the weight vector in the teacher network. Binary weight vector quantized by the method shown in FIG. 10 (c)>
Figure BDA0003050583820000266
And a binary input vector->
Figure BDA0003050583820000267
Coincidence occurs, namely the included angle is zero; and the amplitude of the binary weight vector is far smaller than that of the weight vector in the teacher network.
It can be seen that, in the embodiment of the present application, the difference between the angle between the feature matrix and the weight matrix in the target binary neural network and the feature matrix and the weight matrix in the teacher network is minimized by introducing the angle loss term into the objective function, so that the performance of the target binary neural network obtained through training is close to the performance of the teacher network, that is, the prediction accuracy rate.
Fig. 11 is a flowchart illustrating an image processing method according to the present application. Method 1100 in fig. 11 includes steps 1110 and 1120.
In some examples, the method 1100 may be performed by an apparatus such as the execution apparatus 110 in fig. 1, the chip shown in fig. 3, and the execution apparatus 210 in fig. 4.
Step 1110, obtain an image to be processed.
And 1120, performing image processing on the image to be processed by using the target binary neural network model to obtain a predicted value of the image to be processed.
Wherein, the target binary neural network model is obtained through K times of training, and in the j +1 th time of training in the K times of training: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Binary neural network model M j Is a student network in a knowledge distillation framework; a teacher network in the knowledge distillation frame is a trained neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer; the target loss function contains an angle loss term; k is a positive integer, j is an integer greater than or equal to zero and less than or equal to K; the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix corresponding to the ith layer of neural network in the teacher network and an input matrix of the (j + 1) th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
In one possible embodiment, the image processing includes at least one of image classification, object detection, or image segmentation.
It can be seen that the method in the embodiment of the present application can be used in any task of image classification, target detection, or image segmentation, and the image processing method in the embodiment of the present application can improve the image processing effect by using the three tasks, that is, the model has good universality.
The specific training process of the target binary neural network model may refer to specific descriptions of the method 700 in fig. 7 and the method 800 in fig. 8, and details are not repeated here.
Alternatively, the method 700 and the method 800 may be processed by a CPU, or may be processed by the CPU and a GPU together, or may use other processors suitable for neural network computation instead of the GPU, which is not limited in this application.
The image processing may include image classification, image segmentation, object detection, or other related image processing, which is not specifically limited in this application. The application of the method 1100 to the fields of image classification, image segmentation, and object detection will be described in detail below.
Image classification: inputting an image to be processed into a target binary neural network model, extracting features of the image to be processed by a trunk network in the model to obtain a feature vector of the image to be processed, performing corresponding calculation layer by layer based on the feature vector, and finally obtaining a predicted value of the image to be processed, wherein the predicted value can be a multi-dimensional vector, each element in the multi-dimensional vector corresponds to an image category, and each element in the multi-dimensional vector is used for representing the probability value of the image to be processed into the image category corresponding to each element.
Image segmentation: inputting an image to be processed into a target binary neural network model, extracting the features of the image to be processed by a trunk network in the model to obtain the feature vector of the image to be processed, and performing corresponding calculation layer by layer based on the feature vector to obtain a plurality of predicted values of the image to be processed, wherein the predicted values correspond to a plurality of pixel points of the image to be processed one by one. Each predicted value in the plurality of predicted values is a multi-dimensional vector, each element in any multi-dimensional vector corresponds to an image category, and each element in any multi-dimensional vector is used for representing the probability value of the pixel point corresponding to any multi-dimensional vector as the image category corresponding to each element.
Target detection: inputting an image to be processed into a target binary neural network model, extracting features of the image to be processed by a trunk network in the model to obtain a feature vector of the image to be processed, firstly identifying and segmenting a target object in the image to be processed based on the extracted feature vector by the model to obtain a target area corresponding to the target object, and finally outputting a multi-dimensional vector corresponding to the target area, wherein the meaning represented by the multi-dimensional vector is the same as that in the image classification, and the description is omitted here.
It is to be understood that the embodiment described in fig. 7 and fig. 8 is a training phase (a phase performed by the training device 120 shown in fig. 1) of the binary neural network model, and a specific training is performed by using any one of the possible implementations based on the embodiment shown in fig. 7 or fig. 8; the embodiment described in fig. 11 may be understood as an application stage (a stage executed by the execution device 110 shown in fig. 1) of the binary neural network model, which may be embodied as using the target binary neural network model obtained by training in the embodiment shown in fig. 7 or fig. 8, and obtaining a predicted value of the image to be processed according to the image to be processed input by the user.
Referring to fig. 12, fig. 12 is a flowchart illustrating another image processing method 1200 according to an embodiment of the present disclosure. Method 1200 in fig. 12 includes steps 1210 and 1220.
In some examples, the method 1200 may be performed by the execution device 110 of fig. 1, the chip shown in fig. 3, and the execution device 210 of fig. 4, among other devices.
Step 1210, acquiring an image to be processed.
Step 1220, performing image processing on the image to be processed by using the target binary neural network model to obtain a predicted value of the image to be processed; wherein the target binary neural network model is an initial binary neural network model M in a knowledge distillation framework through a target loss function 0 Training the obtained, initial binary neural network model M 0 The network is a student network in the knowledge distillation frame, and a teacher network in the knowledge distillation frame is a trained neural network model; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between an included angle between the feature matrix and the weight matrix in the teacher network and an included angle between the feature matrix and the weight matrix in the student network.
It can be seen that, in the embodiment of the present application, since the method in the first aspect introduces a knowledge distillation framework during training and introduces a corresponding angle loss term in the target loss function, the model accuracy of the target binary neural network model obtained by training with the method in the first aspect is greatly improved compared with the existing binary neural network model; meanwhile, compared with a neural network model, the binary neural network has fewer model parameters and is lighter, so that the binary neural network has a good application prospect in embedded equipment.
Referring to table 1, table 1 describes the detection effect and the relevant performance of the model when the target is detected by using the binary neural network model and the teacher network trained on the Pattern Analysis, statistical modeling and Computational learning (PASCAL VOC) data set by using different methods.
In the simulation experiment, a target detection Framework is respectively carried out by adopting a fast Region-based Convolutional Neural network (fast RCNN) and a Single Shot Multi-Box Detector (SSD); wherein, the Faster RCNN is a general two-stage object detection framework, and the SSD is a general single-stage object detection framework.
Under the condition that the short edge and the long edge in the fast RCNN are respectively 600 × 1000 pixel points (namely, input in Table 1 is 600 × 1000), the Backbone network respectively adopts three networks of ResNet-18, resNet-34 and ResNet-50 in a residual neural network (ResNet); then, in a framework containing different backbone networks, quantizing the model by respectively adopting different Quantization methods (Quantization methods), wherein the Quantization methods comprise: a Real value Real-value, wherein the numerical value type in the quantized network is a 32-bit (bit) floating point type; a Low bit width Convolutional Neural network (DeRoFa-Net) for Low bit width gradient Training, wherein the numerical value type in the quantized network is a 4-bit floating point type; and a 1-bit convolutional neural network (Bi-Real-Net) obtained by applying Improved representation Capability and Advanced Training Algorithm; an Efficient binary Object Detector (BiDet); precision Binary Neural networks (Towards precision Neural networks with Generalized Activation Functions, rectNet) and hierarchical-wise search (LWS-Det) with Generalized Activation Functions; after the four quantization modes, namely Bi-Real-Net, biDet, recActNet and LWS-Det, are quantized, the numerical type in the network is 1-bit integer, and the LWS-Det mode is the quantization mode adopted in the embodiment of the application. Table 1 also includes Memory Usage, GFLOPs, and average precision average (mep) values for different quantization modes.
It can be seen that when tested on the PASCAL VOC data set, the model precision obtained by training in the Real-value mode is 76.4%, 77.8% and 79.5% respectively; then, based on the distillation method of the embodiment of the application, resNet-18/34/50 is trained as a binary target detection model LWS-Det of the main network respectively, and the accuracy on the test set is 73.2%, 75.8% and 76.9%, which greatly speeds up the calculation and saves the storage space by 6.79/5.88/5.57 times respectively. Compared with other binary quantization methods, the LWS-Det has remarkable performance improvement. By using a ResNet-18 backbone network, the mAP performance is respectively improved by 12.3%, 10.5% and 3.6% compared with the LWS-Det, the Bidet and the ReActNet under the condition of the same memory and calculation resource usage. Similarly, using the ResNet-34 backbone network, LWS-Det is 12.7%, 10.0%, and 3.5% higher than the mAP performance of Bi-Real-Net, biDet, and RectNet, respectively. Furthermore, using the ResNet-50 backbone network, LWS-Det is 11.2% and 3.8% higher than the mAP performance of Bi-Real-Net and RectNet, respectively. In addition, under a ResNet-34 backbone network, compared with a 4-bit DoReFa-Net method, LWS-Det uses lower GFLOPs and memory space, but is higher than 0.2% mAP, and the effect improvement of the embodiment of the application is very obvious.
When the short edge and the long edge in the SSD are respectively 300 × 300 pixel points (i.e. input is 300 × 300 in table 1), and the Backbone network backhaul adopts VGG-16, and quantization modes respectively adopt Real-value, deRoFa-Net, bi-Real-Net, biDet, reActNet, and LWS-Det, the model value type W/a, memory Usage, GFLOPs, and maps after quantization in different quantization modes can be referred to in table 1, which is not described herein again.
It can be seen that LWS-Det can achieve 14.76% computational acceleration and 4.81% storage compression on a VGG-16 backbone network based SSD framework. The gap in mAP performance is small compared to Real-value (about 2.9%). LWS-Det can improve mAP performance by 7.6%, 5.4% and 3.0% respectively under the same GFLOPs and memory use conditions compared with Bi-Real-Net, biDet and ReActNet. The mAP performance of LWS-Det is improved by 2.2% compared with 4-bit DoReFa-Net, and the GFLOPs and the memory usage amount are obviously reduced.
In conclusion, compared with the binary neural network on various detection frames, the LWS-Det realizes the most advanced performance, achieves the performance close to the full-precision Real-value model, is proved in a large number of experiments, clearly verifies the advantages of the LWS-Det, and shows the superiority and universality of the LWS-Det in different application scenes.
Figure BDA0003050583820000291
Table 1 is a comparison table of experimental results of the binary neural network and other networks on the PASCAL VOC data set in the embodiment of the application
Referring to fig. 13, fig. 13 is a schematic diagram of a training apparatus 1300 for a binary neural network model according to an embodiment of the present disclosure. The apparatus 1300 comprises a determination unit 1310, a training unit 1320, and a decision unit 1330; wherein, the first and the second end of the pipe are connected with each other,
a determining unit 1310 for executing step S1.
A training unit 1320, configured to perform step S2.
A decision unit 1330, configured to perform step S3.
Step S1: determining a knowledge distillation framework; wherein, the teacher network in the knowledge distillation frame is a trained neural network model, and the student network in the knowledge distillation frame is an initial binary neural network model M 0 The teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer. Step S2: training a binary neural network model by using a j +1 th batch of images and a target loss functionM j To obtain a binary neural network model M j+1 (ii) a Wherein, the binary neural network model M j Is obtained based on the jth batch of image training, wherein j is a positive integer; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer neural network in the student network and a binary input matrix of a j +1 th batch of images in the ith layer neural network in the student network; i is a positive integer less than or equal to N. And step S3: when a preset condition is met, the binary neural network model M is used j+1 As a target binary neural network model; otherwise let j = j +1 and repeat step S2.
In a possible embodiment, the objective loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of a j +1 th batch of images in the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network.
In a possible embodiment, the objective loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
In a possible embodiment, the training unit is specifically configured to: inputting the j +1 th batch of images into a binary neural network model M j To obtain the firstPredicting values of j +1 batches of images; updating a binary neural network model M based on a predicted value of a j +1 th batch of images, a label of the j +1 th batch of images and a target loss function j Obtaining parameters of each layer of neural network to obtain a binary neural network model M j+1
In one possible embodiment, the (j + 1) th batch of images is input into the binary neural network model M j To obtain the predicted value of the j +1 th batch of images, the training unit 1320 is specifically configured to: p1: model M based on binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using the reference weight matrix and the probability matrix corresponding to the ith layer of neural network; p2: obtaining a second convolution output result of the ith layer of neural network according to the binary input matrix and the binary weight matrix of the (j + 1) th batch of images in the ith layer of neural network; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix from the element at any position in the reference weight matrix; p3: and (5) repeating the steps P1-P2 by enabling i = i +1, and obtaining a predicted value of the j +1 th batch of images based on a second convolution output result of the Nth layer neural network.
In a possible embodiment, the reference weight matrix includes a first reference weight matrix and a second reference weight matrix, and the probability matrix includes a first probability matrix and a second probability matrix; based on a binary neural network model M j In an aspect that the reference weight matrix and the probability matrix corresponding to the ith layer neural network obtain a binary weight matrix of the ith layer neural network, the training unit 1320 is specifically configured to: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing that the element at any position in the binary weight matrix is taken from any position in the second reference weight matrixProbability values of the elements.
In a possible implementation manner, in terms of obtaining the second convolution output result of the i-th layer neural network according to the binary input matrix and the binary weight matrix of the j + 1-th batch of images in the i-th layer neural network, the training unit 1320 is specifically configured to: respectively performing convolution operation on a binary input matrix and a binary weight matrix of each image in the (j + 1) th batch of images in the ith layer of neural network to obtain a reference feature matrix of each image; and scaling the reference characteristic matrix of each image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
Referring to fig. 14, fig. 14 is a schematic diagram of another model training apparatus 1400 according to an embodiment of the present disclosure. The apparatus 1400 includes a training unit 1410 and a decision unit 1420. The model comprises a teacher network and a student network, wherein the teacher network is a trained neural network model, the student network is a binary neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer.
A training unit 1410, configured to train the binary neural network model using the teacher network and the target loss function; the target loss function comprises an angle loss term, wherein the angle loss term is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix of an ith layer of neural network in the student network and a binary input matrix of the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
And a decision unit 1420, configured to repeatedly execute the above steps until an iteration termination condition is met, so as to obtain a target binary neural network model.
In one possible embodiment, the target loss function further includes a convolution result loss term; the convolution result loss item is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network; the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix in the ith layer of neural network in the student network.
In one possible embodiment, the target loss function further includes a weight loss term; the weight loss item is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
In one possible embodiment, in the training of the binary neural network model using the teacher network and the target loss function, the training unit is specifically configured to: inputting the training image into a binary neural network model to obtain a predicted value of the training image; and updating parameters in the binary neural network model based on the predicted values of the training images, the labels of the training images and the target loss function.
In a possible implementation manner, in the aspect that the training image is input into the binary neural network model to obtain the predicted value of the training image, the training unit is specifically configured to: p1: obtaining a binary weight matrix of the ith layer of neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer of neural network in the binary neural network model; wherein, any position element in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix from the element at any position in the reference weight matrix; p2: obtaining a second convolution output result of the ith layer of neural network according to the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network; p3: and i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the training image based on a second convolution output result of the Nth layer neural network.
In one possible embodiment, the reference weight matrix comprises a first reference weight matrix and a second reference weight matrix, and the probability matrix comprises a first probability matrix and a second probability matrix; in the aspect of obtaining a binary weight matrix of an ith layer neural network based on a reference weight matrix and a probability matrix corresponding to the ith layer neural network in a binary neural network model, the training unit is specifically configured to: determining an element at any position in the target binary weight matrix based on a corresponding first probability value of any position element in the first reference weight matrix in the first probability matrix and a corresponding second probability value of any position element in the second reference weight matrix in the second probability matrix; wherein, any position element in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; any position element in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
In a possible implementation manner, in the aspect that the second convolution output result of the i-th layer neural network is obtained according to the binary weight matrix and the binary input matrix of the training image in the i-th layer neural network, the training unit is specifically configured to: performing convolution operation on the binary weight matrix and a binary input matrix of the training image in the ith layer of neural network to obtain a reference characteristic matrix of the training image; and scaling the reference characteristic matrix of the training image by using the weight scaling scale factor of the ith layer of neural network to obtain a second convolution output result.
In one possible embodiment, the parameter includes at least one of a probability matrix or a weight scaling scale factor.
Referring to fig. 15, fig. 15 is a schematic structural diagram of an image processing apparatus 1500 according to an embodiment of the present disclosure. The apparatus 1500 includes an acquisition unit 1510 and a processing unit 1520.
An acquisition unit 1510 is configured to acquire an image to be processed.
A processing unit 1520 for utilizing the target binary neural network modelCarrying out image processing on the image to be processed to obtain a predicted value of the image to be processed; wherein, the target binary neural network model is obtained through K times of training, and in the j +1 th time of training in the K times of training: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a Binary neural network model M j Is a student network in a knowledge distillation framework; a teacher network in the knowledge distillation frame is a trained neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer; the objective loss function contains an angle loss term; k is a positive integer, j is an integer greater than or equal to zero and less than or equal to K; the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix corresponding to the ith layer of neural network in the teacher network and an input matrix of the (j + 1) th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network; i is a positive integer less than or equal to N.
In one possible embodiment, the image processing includes at least one of image classification, object detection, or image segmentation.
Specifically, the image processing apparatus 1500 may be configured to process corresponding steps of the image processing method 1100 described in fig. 11, which is not described herein again.
Referring to fig. 16, fig. 16 is a schematic diagram of a hardware structure of a model training apparatus 1600 according to an embodiment of the present application. The model training apparatus 1600 shown in FIG. 16 (which apparatus 1600 may specifically be a computer device) includes a memory 1601, a processor 1602, a communication interface 1603, and a bus 1604. The memory 1601, the processor 1602, and the communication interface 1603 are communicatively connected to each other via a bus 1604.
The memory 1601 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1601 may store a program, and when the program stored in the memory 1601 is executed by the processor 1602, the processor 1602 and the communication interface 1603 are used to perform the steps of the training method of the binary neural network model according to the embodiment of the present application.
The processor 1602 may adopt a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement functions required to be executed by units in the training apparatus of the binary neural network model in the embodiment of the present application, or to execute the model training method in the embodiment of the present application.
The processor 1602 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the binary neural network model of the present application may be implemented by integrated logic circuits of hardware in the processor 1602 or instructions in the form of software. The processor 1602 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1601, and the processor 1602 reads the information in the memory 1601, and completes the functions to be executed by the units included in the training apparatus of the binary neural network model according to the embodiment of the present application, or executes the training method of the binary neural network model according to the embodiment of the present application.
Communication interface 1603 enables communication between apparatus 1600 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, training data may be obtained via communication interface 1603.
The bus 1604 may include pathways for communicating information between various components of the device 1600 (e.g., the memory 1601, the processor 1602, the communication interface 1603).
Referring to fig. 17, fig. 17 is a schematic diagram of a hardware structure of an image processing apparatus 1700 according to an embodiment of the present application. The image processing apparatus 1700 may be an automobile, a camera, a computer, a mobile phone, a wearable device, or other possible terminal devices, which is not limited in this application. The image processing apparatus 1700 shown in fig. 17 (the apparatus 1700 may be a computer device in particular) includes a memory 1701, a processor 1702, a communication interface 1703, and a bus 1704. The memory 1701, the processor 1702, and the communication interface 1703 are communicatively connected to each other via the bus 1704.
The memory 1701 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 1701 may store a program, and when the program stored in the memory 1701 is executed by the processor 1702, the processor 1702 and the communication interface 1703 are used to execute the steps of the image processing method of the embodiment of the present application.
The processor 1702 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU) or one or more integrated circuits, and is configured to execute related programs to implement the functions required to be executed by the units in the image processing apparatus according to the embodiment of the present application, or to execute the image processing method according to the embodiment of the present application.
The processor 1702 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the image processing method of the present application may be implemented by instructions in the form of hardware integrated logic circuits or software in the processor 1702. The processor 1702 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1701, and the processor 1702 reads information in the memory 1701, and in conjunction with hardware thereof, performs a function required to be performed by a unit included in the image processing apparatus of the embodiment of the present application, or performs the image processing method of the embodiment of the method of the present application.
Communication interface 1703 enables communication between apparatus 1700 and other devices or a communication network using transceiver means, such as, but not limited to, a transceiver. For example, the training data may be obtained through the communication interface 1703.
The bus 1704 may include paths that convey information between various components of the apparatus 1700 (e.g., the memory 1701, the processor 1702, and the communication interface 1703).
It should be noted that although the apparatus 1600 and the apparatus 1700 shown in fig. 16 and 17 only show memories, processors, and communication interfaces, in a specific implementation process, those skilled in the art will understand that the apparatus 1600 and the apparatus 1700 also include other devices necessary for normal operation. Also, those skilled in the art will appreciate that apparatus 1600 and apparatus 1700 may also include hardware components for performing other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that apparatus 1600 and apparatus 1700 may also include only those components necessary to implement embodiments of the present application, and need not include all of the components shown in FIG. 16 or FIG. 17.
It is understood that the apparatus 1600 described above corresponds to the training device 120 of fig. 1, and the apparatus 1700 corresponds to the performing device 110 of fig. 1. Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The above-described functions, if implemented in the form of software functional units and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (21)

1. An image processing method, characterized in that the method comprises:
acquiring an image to be processed;
performing image processing on the image to be processed by using a target binary neural network model to obtain a feature vector of the image to be processed;
identifying and segmenting a target object in the image to be processed based on the feature vector to obtain a target area corresponding to the target object and a multi-dimensional vector corresponding to the target area, wherein each element in the multi-dimensional vector corresponds to one image category, and each element is used for representing the probability value of the image to be processed being the image category corresponding to each element;
wherein the target binary neural network model is obtained through K times of training, and in the j +1 th time of training in the K times of training: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a The binary neural network model M j A student network in a knowledge distillation framework; a teacher network in the knowledge distillation frame is a trained neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer; the target loss function comprises an angle loss term; the K is a positive integer, the j is an integer which is greater than or equal to zero and less than or equal to the K;
the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix corresponding to an ith layer of neural network in the teacher network and an input matrix of the j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network; and i is a positive integer less than or equal to N.
2. The method of claim 1, wherein the target loss function further comprises a convolution result loss term; wherein the convolution result loss term is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network;
the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the (j + 1) th batch of images in the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network.
3. The method of claim 1 or 2, wherein the objective loss function further comprises a weight loss term;
wherein the weight loss term is used for describing the difference between the weight matrix of the ith layer neural network in the teacher network and the binary weight matrix of the ith layer neural network in the student network.
4. The method according to claim 1 or 2, wherein the training of the binary neural network model M is performed by using the j +1 th batch of images and an object loss function j To obtain a binary neural network model M j+1 The method comprises the following steps:
inputting the (j + 1) th batch of images into the binary neural network model M j Obtaining a predicted value of the j +1 th batch of images;
updating the binary neural network model M based on the predicted value of the j +1 th batch of images, the label of the j +1 th batch of images and the target loss function j Obtaining parameters in each layer of neural network to obtain the binary neural network model M j+1
5. The method according to claim 4, wherein the j +1 th batch of images is input into the binary neural network model M j And obtaining a predicted value of the j +1 th batch of images, comprising:
p1: model M based on binary neural network j The sum of reference weight matrix corresponding to the neural network of the ith layerObtaining a binary weight matrix of the ith layer of neural network by using a probability matrix; wherein, the element at any position in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix to the element at any position in the reference weight matrix;
p2, obtaining a second convolution output result of the ith layer of neural network according to the binary input matrix and the binary weight matrix of the (j + 1) th batch of images in the ith layer of neural network;
p3: and i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the j +1 th batch of images based on a second convolution output result of the Nth layer neural network.
6. The method of claim 5, wherein the reference weight matrix comprises a first reference weight matrix and a second reference weight matrix and the probability matrix comprises a first probability matrix and a second probability matrix; the model M based on the binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using a reference weight matrix and a probability matrix corresponding to the ith layer of neural network, wherein the binary weight matrix comprises:
determining an element at any position in a target binary weight matrix based on a corresponding first probability value of the element at the any position in the first reference weight matrix in the first probability matrix and a corresponding second probability value of the element at the any position in the second reference weight matrix in the second probability matrix;
wherein the element at any position in the first probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the first reference weight matrix; and the element at any position in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
7. The method as claimed in claim 5 or 6, wherein the obtaining of the second convolution output result of the i-th layer neural network according to the binary input matrix and the binary weight matrix of the j + 1-th batch of images in the i-th layer neural network comprises:
performing convolution operation on a binary input matrix and the binary weight matrix of each image in the (j + 1) th batch of images in the ith layer of neural network respectively to obtain a reference feature matrix of each image;
and scaling the reference characteristic matrix of each image by using the weight scaling scale factor of the ith layer of neural network to obtain the second convolution output result.
8. The method of claim 7, wherein the parameters comprise at least one of the probability matrix or the weight scaling scale factor.
9. An image processing method, characterized in that the method comprises:
acquiring an image to be processed;
performing image processing on the image to be processed by using a target binary neural network model to obtain a feature vector of the image to be processed;
identifying and segmenting a target object in the image to be processed based on the feature vector to obtain a target area corresponding to the target object and a multi-dimensional vector corresponding to the target area, wherein each element in the multi-dimensional vector corresponds to one image category, and each element is used for representing the probability value of the image to be processed as the image category corresponding to each element; wherein the target binary neural network model is an initial binary neural network model M in knowledge distillation framework through a target loss function 0 The initial binary neural network model M is obtained by training 0 The student network in the knowledge distillation frame is used as the teacher network in the knowledge distillation frame, and the teacher network in the knowledge distillation frame is a trained neural network model; the target loss function comprises an angle loss item which is used for describing an included angle between a feature matrix and a weight matrix in the teacher network and the teacherAnd generating the difference of included angles between the feature matrix and the weight matrix in the network.
10. An image processing apparatus, characterized in that the apparatus comprises:
the acquisition unit is used for acquiring an image to be processed;
the processing unit is used for carrying out image processing on the image to be processed by utilizing a target binary neural network model to obtain a feature vector of the image to be processed;
the processing unit is further configured to identify and segment a target object in the image to be processed based on the feature vector to obtain a target region corresponding to the target object and a multidimensional vector corresponding to the target region, where each element in the multidimensional vector corresponds to one image category, and each element is used to represent a probability value of the image to be processed being the image category corresponding to each element;
wherein the target binary neural network model is obtained through K times of training, and in the j +1 th time of training in the K times of training: training a binary neural network model M by using a j +1 th batch of images and a target loss function j To obtain a binary neural network model M j+1 (ii) a The binary neural network model M j A student network in a knowledge distillation framework; a teacher network in the knowledge distillation frame is a trained neural network model, the teacher network and the student network respectively comprise N layers of neural networks, and N is a positive integer; the target loss function comprises an angle loss term; k is a positive integer, j is an integer greater than or equal to zero and less than or equal to K;
the angle loss item is used for describing the difference between a first angle corresponding to the ith layer of neural network in the teacher network and a second angle corresponding to the ith layer of neural network in the student network; the first angle is obtained based on a weight matrix corresponding to the ith layer of neural network in the teacher network and an input matrix of the j +1 th batch of images in the ith layer of neural network in the teacher network; the second angle is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network; and i is a positive integer less than or equal to N.
11. The apparatus of claim 10, wherein the target loss function further comprises a convolution result loss term; wherein the convolution result loss term is used for describing the difference between a first convolution output result of the ith layer of neural network in the teacher network and a second convolution output result of the ith layer of neural network in the student network;
the first convolution output result is obtained based on a weight matrix of an ith layer of neural network in the teacher network and an input matrix of the (j + 1) th batch of images in the ith layer of neural network in the teacher network; and the second convolution output result is obtained based on a binary weight matrix corresponding to the ith layer of neural network in the student network, a corresponding weight scaling scale factor and a binary input matrix of the j +1 th batch of images in the ith layer of neural network in the student network.
12. The apparatus of claim 10 or 11, wherein the target loss function further comprises a weight loss term;
wherein the weight loss term is used for describing the difference between the weight matrix of the i-th layer neural network in the teacher network and the binary weight matrix of the i-th layer neural network in the student network.
13. The apparatus according to claim 10 or 11, wherein the processing unit is specifically configured to:
inputting the (j + 1) th batch of images into the binary neural network model M j Obtaining a predicted value of the j +1 th batch of images;
updating the binary neural network model M based on the predicted value of the j +1 th batch of images, the label of the j +1 th batch of images and the target loss function j Obtaining parameters of each layer of neural network to obtain the binary neural network model M j+1
14. The apparatus according to claim 13, wherein the j +1 th batch of images is input into the binary neural network model M j And in an aspect of obtaining a predicted value of the j +1 th batch of images, the processing unit is specifically configured to:
p1: model M based on binary neural network j Obtaining a binary weight matrix of the ith layer of neural network by using a reference weight matrix and a probability matrix corresponding to the ith layer of neural network;
p2: obtaining a second convolution output result of the ith layer of neural network according to the binary input matrix and the binary weight matrix of the (j + 1) th batch of images in the ith layer of neural network; wherein, the element at any position in the probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the reference weight matrix;
p3: and (5) enabling i = i +1, repeating the steps P1-P2, and obtaining a predicted value of the j +1 th batch of images based on a second convolution output result of the Nth layer neural network.
15. The apparatus of claim 14, wherein the reference weight matrix comprises a first reference weight matrix and a second reference weight matrix, and wherein the probability matrix comprises a first probability matrix and a second probability matrix; in the binary-based neural network model M j In an aspect that a reference weight matrix and a probability matrix corresponding to the ith layer neural network obtain a binary weight matrix of the ith layer neural network, the processing unit is specifically configured to:
determining an element at any position in a target binary weight matrix based on a corresponding first probability value of the element at the any position in the first reference weight matrix in the first probability matrix and a corresponding second probability value of the element at the any position in the second reference weight matrix in the second probability matrix;
wherein the element at any position in the first probability matrix is used to characterize the probability value of the element at the any position in the binary weight matrix taking the element at the any position in the first reference weight matrix; and the element at any position in the second probability matrix is used for representing the probability value of the element at any position in the binary weight matrix taking the element at any position in the second reference weight matrix.
16. The apparatus according to claim 14 or 15, wherein in the aspect of obtaining the second convolution output result of the i-th layer neural network according to the binary input matrix and the binary weight matrix of the j + 1-th batch of images in the i-th layer neural network, the processing unit is specifically configured to:
performing convolution operation on a binary input matrix and the binary weight matrix of each image in the (j + 1) th batch of images in the ith layer of neural network respectively to obtain a reference feature matrix of each image;
and scaling the reference characteristic matrix of each image by using the weight scaling scale factor of the ith layer of neural network to obtain the second convolution output result.
17. The apparatus of claim 16, wherein the parameters comprise at least one of the probability matrix or the weight scaling scale factor.
18. An image processing apparatus, characterized in that the apparatus comprises:
the acquisition unit is used for acquiring an image to be processed;
the processing unit is used for carrying out image processing on the image to be processed by utilizing a target binary neural network model to obtain a feature vector of the image to be processed;
the processing unit is further configured to identify and segment a target object in the image to be processed based on the feature vector to obtain a target region corresponding to the target object and a multidimensional vector corresponding to the target region, where each element in the multidimensional vector corresponds to one image category, and each element is used to represent a probability value of the image to be processed being the image category corresponding to each element;
wherein the target binary neural network model is an initial binary neural network model M in knowledge distillation framework through a target loss function 0 The initial binary neural network model M is obtained by training 0 The student network in the knowledge distillation frame is used as the teacher network in the knowledge distillation frame, and the teacher network in the knowledge distillation frame is a trained neural network model; the target loss function comprises an angle loss item, and the angle loss item is used for describing the difference between the included angle between the feature matrix and the weight matrix in the teacher network and the included angle between the feature matrix and the weight matrix in the student network.
19. A system on a chip, the system on a chip comprising a processor and a memory; wherein, the first and the second end of the pipe are connected with each other,
the memory to store a target binary neural network model and program instructions;
the processor, configured to read the program instructions to invoke the target binary neural network model to perform the method according to any one of claims 1 to 9.
20. A terminal device, characterized in that the terminal device comprises a system-on-chip as claimed in claim 19, and a discrete device coupled to the system-on-chip; wherein, terminal equipment includes car, camera, computer, cell-phone or wearable equipment.
21. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program code for execution by a device, the program code comprising instructions for performing the method of any of claims 1 to 9.
CN202110494162.5A 2021-04-30 2021-04-30 Training method of binary neural network model, image processing method and device Active CN113191489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110494162.5A CN113191489B (en) 2021-04-30 2021-04-30 Training method of binary neural network model, image processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110494162.5A CN113191489B (en) 2021-04-30 2021-04-30 Training method of binary neural network model, image processing method and device

Publications (2)

Publication Number Publication Date
CN113191489A CN113191489A (en) 2021-07-30
CN113191489B true CN113191489B (en) 2023-04-18

Family

ID=76984122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110494162.5A Active CN113191489B (en) 2021-04-30 2021-04-30 Training method of binary neural network model, image processing method and device

Country Status (1)

Country Link
CN (1) CN113191489B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114358206B (en) * 2022-01-12 2022-11-01 合肥工业大学 Binary neural network model training method and system, and image processing method and system
CN114822510B (en) * 2022-06-28 2022-10-04 中科南京智能技术研究院 Voice awakening method and system based on binary convolutional neural network
CN117474051A (en) * 2022-07-15 2024-01-30 华为技术有限公司 Binary quantization method, training method and device for neural network, and storage medium
CN115147418B (en) * 2022-09-05 2022-12-27 东声(苏州)智能科技有限公司 Compression training method and device for defect detection model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956263A (en) * 2019-11-14 2020-04-03 深圳华侨城文化旅游科技集团有限公司 Construction method of binarization neural network, storage medium and terminal equipment
CN111723815A (en) * 2020-06-23 2020-09-29 中国工商银行股份有限公司 Model training method, image processing method, device, computer system, and medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10599977B2 (en) * 2016-08-23 2020-03-24 International Business Machines Corporation Cascaded neural networks using test ouput from the first neural network to train the second neural network
CN108846340B (en) * 2018-06-05 2023-07-25 腾讯科技(深圳)有限公司 Face recognition method and device, classification model training method and device, storage medium and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956263A (en) * 2019-11-14 2020-04-03 深圳华侨城文化旅游科技集团有限公司 Construction method of binarization neural network, storage medium and terminal equipment
CN111723815A (en) * 2020-06-23 2020-09-29 中国工商银行股份有限公司 Model training method, image processing method, device, computer system, and medium

Also Published As

Publication number Publication date
CN113191489A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN110033003B (en) Image segmentation method and image processing device
CN110188795B (en) Image classification method, data processing method and device
CN110378381B (en) Object detection method, device and computer storage medium
CN113191489B (en) Training method of binary neural network model, image processing method and device
CN112236779A (en) Image processing method and image processing device based on convolutional neural network
CN111291809B (en) Processing device, method and storage medium
CN112446270A (en) Training method of pedestrian re-identification network, and pedestrian re-identification method and device
CN111914997B (en) Method for training neural network, image processing method and device
CN110222717B (en) Image processing method and device
CN112288011B (en) Image matching method based on self-attention deep neural network
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN112418392A (en) Neural network construction method and device
CN112446380A (en) Image processing method and device
CN112446834A (en) Image enhancement method and device
CN113705769A (en) Neural network training method and device
US20220157046A1 (en) Image Classification Method And Apparatus
CN112446398A (en) Image classification method and device
CN111882031A (en) Neural network distillation method and device
CN110222718B (en) Image processing method and device
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN113065645B (en) Twin attention network, image processing method and device
CN113326930A (en) Data processing method, neural network training method, related device and equipment
CN111797882A (en) Image classification method and device
CN112598597A (en) Training method of noise reduction model and related device
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant