WO2021218471A1

WO2021218471A1 - Neural network for image processing and related device

Info

Publication number: WO2021218471A1
Application number: PCT/CN2021/081238
Authority: WO
Inventors: 王一飞; 刘扶芮; 李震国
Original assignee: 华为技术有限公司
Priority date: 2020-04-30
Filing date: 2021-03-17
Publication date: 2021-11-04
Also published as: CN111695596A

Abstract

A training method for a neural network for image processing and a related device, relating to an image processing technology in the field of artificial intelligence. The method comprises: separately inputting an adversarial image into a first feature extraction network and a second feature extraction network to obtain robust representation and non-robust representation; separately inputting the robust representation and the non-robust representation into a classification network to obtain a first classification category and a second classification category output by the classification network; and performing iterative training according to a first loss function until a convergence condition is met. The first loss function is used for representing the similarity between the first category and a correct category corresponding to the adversarial image, and the similarity between the second classification category and an error category corresponding to the adversarial image. The robustness reduction caused by mixing of the robust representation and the non-robust representation is avoided, the robust representation and the non-robust representation can also be reserved at the same time, and thus the accuracy reduction is avoided, and robustness and accuracy of the neural network are improved at the same time.

Description

A neural network and related equipment for image processing

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 202010362629.6, and the invention title is "a neural network and related equipment for image processing" on April 30, 2020, the entire content of which is by reference Incorporated in this application.

Technical field

This application relates to the field of artificial intelligence, and in particular to a neural network and related equipment for image processing.

Background technique

Artificial Intelligence (AI) is the use of computers or computer-controlled machines to simulate, extend and expand human intelligence. Artificial intelligence includes studying the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. At present, image processing based on deep learning neural networks is a common application of artificial intelligence.

Although today's neural networks have extremely high recognition accuracy, researchers have found that after extremely small disturbances to the input original image, they can confuse the neural network with high recognition accuracy and make its recognition accuracy sharp. Falling, this disturbed image, we call an adversarial image.

In order to improve the robustness of the neural network, people propose adversarial training, that is, adding the adversarial image and the correct label corresponding to the adversarial image to the training data set to train the neural network, thereby improving the robustness of the neural network to the adversarial image. Robustness means that the neural network can still accurately recognize the confrontation image.

However, studies have found that with the improvement of the robustness of neural networks for processing counter images, the recognition accuracy of neural networks for processing original images continues to decrease. Therefore, a solution that can simultaneously improve robustness and recognition accuracy is needed.

Summary of the invention

The embodiments of the application provide a neural network and related equipment for image processing. The trained first feature extraction network and the trained second feature extraction network can respectively extract robust and non-robust representations in input images. Representation, which not only avoids the mixing of the two and reduces the robustness, but also retains the robust and non-robust representations in the input image at the same time, thereby avoiding the decrease in accuracy and improving the robustness of the neural network at the same time And accuracy.

In order to solve the above technical problems, the embodiments of this application provide the following technical solutions:

In the first aspect, an embodiment of the present application provides a neural network training method, which can be used in the image processing field of the artificial intelligence field. The training device inputs the confrontation image into the first feature extraction network and the second feature extraction network, respectively, to obtain a first robust representation generated by the first feature extraction network and a first non-robust representation generated by the second feature extraction network. Among them, the counter image is an image that has undergone disturbance processing. Disturbance processing refers to adjusting the pixel values of pixels in the original image based on the original image to obtain a disturbed image. For the human eye, it is usually It is difficult to distinguish the difference between the confrontation image and the original image. Both the first robust representation and the second robust representation include feature information extracted from the confrontation image. Robust representation refers to features that are not sensitive to disturbances, the classification category corresponding to the robust representation extracted from the original image, and the robust representation extracted from the confrontational image corresponding to the original image corresponds to The classification categories are consistent; the non-robust representation refers to the features that are sensitive to disturbances. The classification category corresponding to the non-robust representation extracted from the original image is the same as the non-robust representation extracted from the counter image corresponding to the original image. Robust means that the corresponding classification categories are inconsistent. In other words, the feature information included in the robust representation is similar to the features utilized by the human eye. On the contrary, the feature information included in the non-robust representation cannot be understood by the human eye, and the non-robust representation is noise to the human eye. The training device inputs the first robust representation into the classification network, and obtains the first classification category output by the classification network. The first classification category is the classification category for the object in the confrontation image; the first non-robust representation is input into the classification network to obtain the classification network. The output second classification category, the second classification category is the classification category of the object in the confrontation image. The training device performs iterative training on the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, and outputs the trained first feature extraction network and the trained second feature extraction network. Among them, the first loss function is used to represent the similarity between the first category and the first label category, and is used to represent the similarity between the second classification category and the second label category, and the first loss function may specifically be cross Entropy loss function or maximum interval loss function. The first label category is the correct category corresponding to the confrontation image, the second label category is the error category corresponding to the confrontation image, the first label category includes the correct classification of one or more objects in the confrontation image, and the second label category is the correct category in the confrontation image. For the misclassification of one or more objects, both the first label category and the second label category are used as supervision data in the training phase. The convergence condition may be the convergence condition of the first loss function, or it may be that the number of iterations of training reaches a preset number.

In this implementation, the neural network includes a first feature extraction network and a second feature extraction network, and the confrontation image is input into the first feature extraction network and the second feature extraction network, respectively, to obtain the first feature extraction network generated by the first feature extraction network. The first non-robust representation generated by the robust representation and the second feature extraction network, and then the first robust representation is input into the classification network to obtain the first classification category output by the classification network, and the first non-robust representation is input into the classification network , Obtain the second classification category output by the classification network, and use the first loss function to iteratively train the first feature extraction network and the second feature extraction network. The purpose of the first loss function is to bring the first classification category closer to the correctness of the confrontation image. The similarity between the categories, and the similarity between the second classification category and the wrong category of the confrontation image, that is, the purpose of training is to extract the robust representation in the input image through the first feature extraction network, and through the second The feature extraction network extracts the non-robust representation in the input image. During the research process, the technicians found that through the adversarial training, the neural network only extracts the robust representation from the input image, and discards the non-robust representation, resulting in neural network processing The accuracy of the original image is reduced. In the embodiment of the present application, the robust representation and the non-robust representation in the input image are extracted through the first feature extraction network and the second feature extraction network respectively, which avoids the mixing of the two and leads to The reduction in robustness can also retain both the robust representation and the non-robust representation in the input image, thereby avoiding the reduction in accuracy, and improving the robustness and accuracy of the neural network at the same time.

In a possible implementation of the first aspect, the method further includes: the training device inputs the original image into the first feature extraction network and the second feature extraction network, respectively, to obtain the second robust representation and the first feature extraction network generated by the first feature extraction network. The second non-robust representation generated by the two feature extraction network. Among them, the original image refers to an image that has not undergone perturbation processing, or it can also be an image directly collected. The training device combines the second robust representation and the second non-robust representation to obtain the combined first representation, and inputs the combined first representation into the classification network to perform classification operations based on the combined first representation through the classification network , Get the third classification category output by the classification network. Among them, the combination method includes one or more of the following: splicing, addition, fusion, and multiplication. The training device performs iterative training on the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, which may include: the training device extracts the first feature according to the first loss function and the second loss function The network and the second feature extraction network are iteratively trained until the convergence condition is met. The second loss function is used to indicate the similarity between the third classification category and the third label category, and the second loss function may specifically be a cross-entropy loss function or a maximum interval loss function. The third label category is the correct category corresponding to the original image, and the third label category is the correct classification of one or more objects in the original image, which can include one or more classification categories, which are used as supervision data in the training phase of.

In this implementation, during the training process, not only the confrontation images are used to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also natural images are used to train the first feature extraction network and the second feature extraction network. Feature extraction capabilities to further improve the accuracy of the trained first feature extraction network and the trained second feature extraction network in the process of processing natural images.

In a possible implementation of the first aspect, the method may further include: training the device to input the original image into the first feature extraction network to obtain a second robust representation generated by the first feature extraction network. After that, the training device inputs the second robust representation into the classification network to perform the classification operation according to the second robust representation through the classification network to obtain the fourth classification category output by the classification network. The fourth classification category includes one or more of the original images. The category of the object. Specifically, in order to make the input format of the second robust representation consistent with the combined first representation format, the training device may combine the second robust representation with a first constant tensor to obtain the combined third representation, and The combined third representation is input to the classification network, and the classification operation is performed by the classification network according to the combined third representation to obtain the fourth classification category output by the classification network. The training device performs iterative training on the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, which may include: the training device extracts the first feature according to the first loss function and the third loss function The network and the second feature extraction network are iteratively trained until the convergence condition is met. The third loss function is used to indicate the similarity between the fourth classification category and the third label category, and the third loss function may specifically be a cross-entropy loss function or a maximum interval loss function. The third label category is the correct category corresponding to the original image. In this implementation, not only the adversarial images are used to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also natural images are used to train the robust representation extraction capabilities of the first feature extraction network to further improve the post-training The accuracy of the first feature extraction network.

In a possible implementation of the first aspect, the length of the first constant tensor is the same as the length of the second non-robust representation, and the front and back positions of the second robust representation and the first constant tensor can be the same as those of the second robust representation. The representation corresponds to the front and back positions of the second non-robust representation. If the second robust representation is in the front and the second non-robust representation in the combined first representation, then the combined third representation is the second robust Represents the front and the first constant tensor; if the second robust non-representation is the front and the second robust representation is the back of the combined first representation, then the first constant tensor is the combined third representation The first and second robust representation is second.

In a possible implementation of the first aspect, the method may further include: training the device to input the original image into the second feature extraction network to obtain a second non-robust representation generated by the second feature extraction network. After that, the training device inputs the second non-robust representation into the classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain the fifth classification category output by the classification network. The fifth classification category includes one or Multiple object categories. Specifically, in order to make the input format of the second non-robust representation consistent with the combined first representation format, the training device may combine the second robust representation with a second constant tensor to obtain the combined fourth representation, The combined fourth representation is input into the classification network to perform a classification operation according to the combined fourth representation through the classification network to obtain the fourth classification category output by the classification network. The training device performs iterative training on the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, which may include: the training device extracts the first feature according to the first loss function and the fourth loss function The network and the second feature extraction network are iteratively trained until the convergence condition is met. Wherein, the fourth loss function is used to indicate the similarity between the fifth classification category and the third label category, and the fourth loss function may specifically be a cross-entropy loss function or a maximum interval loss function. The third label category is the correct category corresponding to the original image.

In this implementation, not only the adversarial images are used to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also natural images are used to train the second feature extraction network's ability to extract non-robust representations to further improve training. The accuracy of the second feature extraction network.

In a possible implementation of the first aspect, the length of the second constant tensor is the same as the length of the second robust representation, and the front and back positions of the second non-robust representation and the second constant tensor can be the same as the second robust representation. The representation corresponds to the front and rear positions of the second non-robust representation. If the second robust representation is in the front and the second non-robust representation in the combined first representation, the second constant is the second constant in the combined fourth representation. The quantity is first and the second non-robust representation is behind; if the second robust non-representation is first and the second robust non-representation is behind in the combined first representation, the second non-robust representation is the second non-robust representation in the combined fourth representation. The bars indicate the front and the second constant tensor behind.

In a possible implementation of the first aspect, the training device inputs the original image into the first feature extraction network and the second feature extraction network, respectively, to obtain the second robust representation generated by the first feature extraction network and the second feature extraction network The second non-robust representation generated. The training device combines the second robust representation and the second robust representation to obtain the combined first representation, and inputs the combined first representation into the classification network to perform classification operations according to the combined first representation through the classification network, Obtain the third classification category output by the classification network. The training device inputs the second robust representation into the classification network to perform a classification operation according to the second robust representation through the classification network to obtain the fourth classification category output by the classification network. The training device inputs the second non-robust representation into the classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain the fifth classification category output by the classification network. The training device performs iterative training on the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, which may include: the training device extracts the first feature according to the first loss function and the fifth loss function The network and the second feature extraction network are trained iteratively until the convergence condition is met. The fifth loss function is used to represent the similarity between the fourth classification category and the third label category, and is used to represent the fifth classification category and the first The similarity between the three label categories is used to indicate the similarity between the sixth classification category and the third label category, and the fifth loss function may specifically be a cross-entropy loss function or a maximum interval loss function. The third label category is the correct category corresponding to the original image.

In this implementation, while improving the processing capabilities of the first feature extraction network and the second feature extraction network on confronting images, it also improves the processing capabilities of the first feature extraction network and the second feature extraction network on natural images, that is, regardless of Whether it is a natural image or an adversarial image, both the trained first feature extraction network and the second feature extraction network can accurately extract robust and non-robust representations, which expands the application scenarios of this solution.

In a possible implementation of the first aspect, the method may further include: the training device generates a first gradient according to the function value of the second loss function; perturbs the original image according to the first gradient to generate a confrontation image, and The third label category is determined as the first label category. Specifically, the training device may generate the function value of the second loss function according to the third classification category and the third label category, and generate the first gradient according to the function value of the second loss function, and bring the first gradient into the preset In the function, the preset coefficient is then multiplied to obtain the disturbance, and then the obtained disturbance is superimposed with the original image to generate a confrontation image. In this implementation, the first gradient is generated according to the similarity between the third classification category and the third annotation category, and the original image is perturbed according to the first gradient, so that the perturbation processing is more pertinent and helps speed up the first The training process of the feature extraction network and the second feature extraction network improves the efficiency of the training process.

In a possible implementation of the first aspect, the method may further include: training the device to generate a second gradient according to the function value of the third loss function; performing perturbation processing on the original image according to the second gradient to generate a confrontation image, and The third label category is determined as the first label category. Specifically, the training device may generate the function value of the third loss function according to the fourth classification category and the third label category, and generate the second gradient according to the function value of the third loss function, and bring the second gradient into the preset In the function, the preset coefficient is then multiplied to obtain the disturbance, and then the obtained disturbance is superimposed with the original image to generate a confrontation image. In this implementation, the original image is disturbed according to the similarity between the fourth classification category output by the second robust representation and the third annotation category according to the classification network, so that the disturbance processing and the first feature extraction network have a better relationship with each other. Pertinence is conducive to improving the feature extraction capability of the first feature extraction network for robust representation.

In a possible implementation of the first aspect, the method may further include: training the device to generate a third gradient according to the function value of the fourth loss function; performing perturbation processing on the original image according to the third gradient to generate a confrontation image, and The third label category is determined as the first label category. Specifically, the training device may generate the function value of the fourth loss function according to the fifth classification category and the third label category, and generate the third gradient according to the function value of the fourth loss function, and bring the third gradient into the preset In the function, the preset coefficient is then multiplied to obtain the disturbance, and then the obtained disturbance is superimposed with the original image to generate a confrontation image. In this implementation, the original image is perturbed according to the similarity between the fifth classification category output by the first feature extraction network and the third annotation category according to the second non-robust representation of the first feature extraction network, so that the perturbation process is consistent with the second feature The extraction networks are more targeted, which helps to improve the feature extraction ability of the first feature extraction network for non-robust representations.

In a possible implementation of the first aspect, the method may further include: the training device combines the first robust representation and the first non-robust representation to obtain a combined second representation, and inputting the combined second representation The classification network obtains the sixth classification category output by the classification network, and the sixth classification category is the category of the object in the confrontation image. The training device inputs the first robust representation into the classification network to obtain the first classification category output by the classification network, and inputs the first non-robust representation into the classification network to obtain the second classification category output by the classification network, which may include: the training device is in the first category When the six classification categories are different from the first label category, the first robust representation is input to the classification network to obtain the first classification category output by the classification network, and the first non-robust representation is input to the classification network to obtain the first classification output of the classification network. Two classification categories. In this implementation, if the sixth classification category is the same as the first annotation category, it proves that the disturbance of the image after the disturbance is too slight. For the first feature extraction network and the second feature extraction network, the processing method is the same as that of the natural image. The processing method of is not much different, and the purpose of training here is to enhance the ability of the first feature extraction network and the second feature extraction network to separate robust and non-robust representations from images with larger disturbances, only in the sixth category When the category is different from the first labeled category, subsequent training operations are performed to improve the efficiency of the training process.

In a possible implementation of the first aspect, the method may further include: when the sixth classification category is different from the first label category, the training device determines the sixth classification category as the second label category. In this implementation manner, a method for obtaining the second label category is provided, which is simple to operate, does not require additional steps, and saves computer resources.

In a possible implementation of the first aspect, the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network. In this implementation manner, two specific implementation manners of the first feature extraction network and the second feature extraction network are provided, which improves the implementation flexibility of the solution.

In the second aspect, the embodiments of the present application provide an image processing network, which can be used in the image processing field of the artificial intelligence field. The image processing network includes a first feature extraction network, a second feature extraction network, and a feature processing network. The first feature extraction network is used to receive the inputted first image and generate a robust representation corresponding to the first image. Representation refers to features that are not sensitive to disturbances. The second feature extraction network is used to receive the input first image, and generate a non-robust representation corresponding to the first image. The non-robust representation refers to a feature that is sensitive to disturbances. The feature processing network is used to obtain a robust representation and a non-robust representation to output the first processing result corresponding to the first image. Among them, the specific implementation mode of the feature processing network and the specific expression mode of the first processing result are related to the function of the entire image processing network. If the function of the image processing network is image classification, the feature processing network is a classification network, and the first processing result is used to indicate the classification category of the entire image. If the function of the image processing network is image recognition, the feature processing network may be a recognition network, and the first processing result is used to indicate the content recognized from the image, such as the text content in the image. If the function of the image processing network is image segmentation, the feature processing network can include a classification network, which is used to generate the classification category of each pixel in the image, and then use the classification category of each pixel in the image to segment the image , The first processing result is the segmented image.

In this implementation, the first feature extraction network and the second feature extraction network are used to extract the robust representation and the non-robust representation in the input image, which not only avoids the mixing of the two and leads to the reduction of robustness, but also retains them at the same time. The robust representation and non-robust representation in the input image avoids the decrease in accuracy, and improves the robustness and accuracy of the neural network at the same time.

In a possible implementation of the second aspect, the feature processing network can be specifically used for: In the first case, the robust representation and the non-robust representation are combined, and the output corresponds to the first image according to the combined representation In the second case, according to the robust representation, the first processing result corresponding to the first image is output, and the first case and the second case are different cases. Among them, the first case may refer to the case where the accuracy of the first processing result is high, and the second case may refer to the case where the robustness of the first processing result is high, or the second case It refers to the situation where the probability of the input image being the disturbed image is very high. In this implementation manner, the image processing network includes both a robust path and a standard path, and the user can flexibly choose which path to use according to the actual situation, which expands the application scenarios of this solution and improves the implementation flexibility of this solution.

In a possible implementation of the second aspect, the feature processing network may also be used to output the first processing result corresponding to the first image according to the non-robust representation in the third case.

In a possible implementation of the second aspect, the feature processing network is specifically represented as a classification network, and the classification network can be specifically used to: perform a classification operation according to the combined representation, and output a classification category corresponding to the first image; or, according to Robust means performing a classification operation and outputting a classification category corresponding to the first image; or performing a classification operation according to a non-robust representation and outputting a classification category corresponding to the first image. In this implementation manner, the provided image processing method falls into the specific application scenario of image classification, which improves the degree of integration with the application scenario.

In a possible implementation of the second aspect, if the function of the image processing network is to determine whether the first image is an original image or a confrontational image, the first processing result indicates that the first image is an original image, or the first processing result indicates The first image is the disturbed image. In this implementation, not only can the feature information extracted by the first feature extraction network and the second extraction network be used to obtain the processing result corresponding to the object in the image, but also the processing result corresponding to the entire image can be obtained, that is, for judgment Whether the image is the original image or the disturbed image expands the application scenarios of this solution.

In a possible implementation of the second aspect, the feature processing network can be used to: generate the first classification category corresponding to the first image according to the robust representation; generate the first classification corresponding to the first image according to the non-robust representation The second classification category; when the first classification category is consistent with the second classification category, the output of the first processing result indicates that the first image is the original image; when the first classification category is inconsistent with the second classification category, output The first processing result of indicates that the first image is a disturbed image. In this implementation manner, by judging whether the seventh classification category and the eighth classification category are consistent, to determine whether the first image is the original image or the confrontation image, the method is simple and the operability is strong.

In a possible implementation of the second aspect, the feature processing network can be specifically used to combine robust representations and non-robust representations, and perform detection operations based on the combined representations, so as to output images corresponding to the first image The detection result, the first processing result includes the detection result. Among them, in one case, the detection result may indicate whether the first image is the original image or the disturbed image; in another case, the detection result may also indicate which objects are included in the first image, that is, it may indicate the first image The object type of the at least one object included in the at least one object, optionally, the detection result may also include the position information of each object in the aforementioned at least one object. In this implementation manner, another implementation manner of determining whether the first image is an original image or a confrontation image is provided, which enhances the implementation flexibility of this solution.

In a possible implementation of the second aspect, the image processing network is one or more of the following: an image classification network, an image recognition network, an image segmentation network, or an image detection network. In this implementation manner, a variety of specific implementation manners of the image processing network are provided, which expands the application scenarios of this solution and improves the implementation flexibility of this solution.

In a possible implementation of the second aspect, the feature processing network includes a perceptron.

In a possible implementation of the second aspect, the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network.

For the specific implementation of the second aspect of the embodiments of the present application and the specific steps in the various possible implementations of the second aspect, as well as the specific meaning of the nouns in each possible implementation, reference may be made to the various possible implementations in the first aspect. The description in the implementation mode will not be repeated here.

In the third aspect, an embodiment of the present application provides a neural network training device, which can be used in the image processing field of the artificial intelligence field. The training device of the neural network may include an input module and a training module. Among them, the input module is used to input the confrontation image into the first feature extraction network and the second feature extraction network to obtain the first robust representation generated by the first feature extraction network and the first non-robust representation generated by the second feature extraction network. Representation, where the adversarial image is an image that has undergone disturbance processing on the original image, robust representation refers to features that are not sensitive to disturbance, and non-robust representation refers to features that are sensitive to disturbance. The input module is also used to input the first robust representation into the classification network to obtain the first classification category output by the classification network, and input the first non-robust representation into the classification network to obtain the second classification category output by the classification network. The training module is used to iteratively train the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, and output the trained first feature extraction network and the trained second feature extraction network . Among them, the first loss function is used to represent the similarity between the first category and the first label category, and is used to represent the similarity between the second classification category and the second label category, and the first label category is the same as the adversarial image Corresponding to the correct category, the second label category is the wrong category corresponding to the confrontation image.

In the third aspect of the embodiments of the present application, the neural network training device including various modules can also be used to implement the steps in the various possible implementations of the first aspect. For the third aspect of the embodiments of the present application and the various possibilities of the third aspect For the specific implementation manners of certain steps in the implementation manners, and the beneficial effects brought by each possible implementation manner, reference may be made to the descriptions in the various possible implementation manners in the first aspect, which will not be repeated here.

In a fourth aspect, an embodiment of the present application provides an image processing method, which can be used in the image processing field of the artificial intelligence field. The method may include: the execution device inputs the first image into the first feature extraction network to obtain a robust representation corresponding to the first image generated by the first feature extraction network, and the robust representation refers to features that are not sensitive to disturbance; the execution device The first image is input into the second feature extraction network to obtain a non-robust representation corresponding to the first image generated by the second feature extraction network. The non-robust representation refers to the feature sensitive to disturbance; the execution device uses the feature processing network to According to the robust representation and the non-robust representation, the first processing result corresponding to the first image is output, and the first feature extraction network, the second feature extraction network, and the feature processing network belong to the same image processing network.

In the fourth aspect of the embodiments of the present application, the execution device may also be used to implement steps in the various possible implementation manners of the second aspect. For the fourth aspect of the embodiments of the present application and some steps in the various possible implementation manners of the fourth aspect For the specific implementation manner of and the beneficial effects brought by each possible implementation manner, reference may be made to the descriptions in the various possible implementation manners in the second aspect, which will not be repeated here.

In a fifth aspect, an embodiment of the present application provides a training device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the above-mentioned first aspect is implemented The training method of the neural network. For the steps executed by the training device in each possible implementation manner of the first aspect by the processor, please refer to the first aspect for details, which will not be repeated here.

In a sixth aspect, an embodiment of the present application provides an execution device, which may include a processor, the processor and a memory are coupled, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the foregoing second aspect is implemented The steps performed by the image processing network. For the steps executed by the image processing network in each possible implementation manner of the second aspect by the processor, please refer to the second aspect for details, which will not be repeated here.

In a seventh aspect, the embodiments of the present application provide a computer-readable storage medium in which a computer program is stored, and when it runs on a computer, the computer executes the neural network described in the first aspect. The training method of the network, or the computer executes the image processing method described in the fourth aspect.

In an eighth aspect, an embodiment of the present application provides a circuit system, the circuit system includes a processing circuit configured to execute the neural network training method described in the first aspect above, or the processing circuit configuration To implement the image processing method described in the fourth aspect.

In a ninth aspect, an embodiment of the present application provides a computer program that, when running on a computer, causes the computer to execute the neural network training method described in the first aspect, or causes the computer to execute the neural network training method described in the fourth aspect. Image processing method.

In a tenth aspect, an embodiment of the present application provides a chip system that includes a processor for supporting training equipment or an image processing network to implement the functions involved in the above aspects, for example, sending or processing the functions involved in the above methods Data and/or information. In a possible design, the chip system further includes a memory, and the memory is used to store necessary program instructions and data for the server or the communication device. The chip system can be composed of chips, and can also include chips and other discrete devices.

Description of the drawings

FIG. 1 is a schematic diagram of a structure of an artificial intelligence main frame provided by an embodiment of this application;

FIG. 2 is a system architecture diagram of an image processing system provided by an embodiment of the application;

FIG. 3 is a schematic flowchart of a neural network training method provided by an embodiment of the application;

FIG. 4 is a schematic diagram of a disturbance operation in a neural network training method provided by an embodiment of the application;

FIG. 5 is a schematic diagram of the robust representation and the non-robust representation after visualization processing in the neural network training method provided by the embodiment of the application;

FIG. 6 is a schematic flowchart of an image processing method provided by an embodiment of the application;

FIG. 7 is a schematic diagram of an image processing network in an image processing method provided by an embodiment of this application;

FIG. 8 is a schematic flowchart of an image processing method provided by an embodiment of the application;

FIG. 9 is a schematic structural diagram of a neural network training device provided by an embodiment of the application;

FIG. 10 is a schematic structural diagram of a neural network training device provided by an embodiment of the application;

FIG. 11 is a schematic structural diagram of a neural network training device provided by an embodiment of the application;

FIG. 12 is a schematic structural diagram of an execution device provided by an embodiment of this application;

FIG. 13 is a schematic structural diagram of a training device provided by an embodiment of the application;

FIG. 14 is a schematic diagram of a structure of a chip provided by an embodiment of the application.

Detailed ways

The embodiments of the present application provide a neural network and related equipment for image processing. The trained first feature extraction network and the trained second feature extraction network can respectively extract robust and non-robust representations in input images. Representation, which not only avoids the mixing of the two and reduces the robustness, but also retains the robust and non-robust representations in the input image at the same time, thereby avoiding the decrease in accuracy and improving the robustness of the neural network at the same time And accuracy.

The terms “first”, second, etc. in the specification and claims of this application and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way It can be interchanged under appropriate circumstances. This is only the way of distinguishing objects with the same attribute in the description of the embodiments of the present application. In addition, the terms "including" and "having" and any variations of them are intended to be Covering non-exclusive inclusion, so that the process, method, system, product or equipment containing a series of units need not be limited to those units, but may include other units that are not clearly listed or are inherent to these processes, methods, products or equipment .

First, the overall workflow of the artificial intelligence system is described. Please refer to Figure 1. Figure 1 shows a schematic diagram of the main framework of artificial intelligence. (Vertical axis) Two dimensions explain the above-mentioned artificial intelligence theme framework. Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom". The "IT value chain" from the underlying infrastructure of human intelligence, information (providing and processing technology realization) to the industrial ecological process of the system, reflects the value that artificial intelligence brings to the information technology industry.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing capabilities are provided by smart chips. As an example, the smart chips include central processing unit (CPU), neural-network processing unit (NPU), and graphics processor (graphics). Processing unit, GPU), application specific integrated circuit (ASIC), field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network related platforms Guarantee and support can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.

Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.

Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies. The typical function is search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.

(4) General ability

After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.

(5) Smart products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent terminals, intelligent manufacturing, Intelligent transportation, smart home, smart medical, smart security, autonomous driving, safe city, etc.

The embodiments of the present application can be mainly applied to the image processing scenarios in the above-mentioned various application fields. As an example, in the field of automatic driving, the sensors on the automatic driving vehicle will transmit the original image to the automatic driving vehicle after collecting the original image. The processor of the autonomous vehicle uses the image processing network to process the transmitted image. If the pixel value of the original image is not disturbed during the transmission of the original image, the processor is processing the original image; if During the transmission of the original image, if the pixel values in the original image are disturbed, the processor will process the confrontation image, that is, the image processed by the processor of the autonomous vehicle may contain both the original image and the confrontation image. As another example, for example, in the field of smart terminals such as mobile phones, computers, and wearable devices, after the smart terminal collects the original image, if the original image is filled with light, filters, etc., the original image is If the pixel value causes a disturbance, the image processed by the smart terminal through the neural network may be the disturbed image, that is, there are pitfalls in the smart terminal field. You will have both the original image and the confrontation image. Among them, the difference between the confrontation image and the original image is not detectable by the human eye, but the confrontation image will greatly reduce the accuracy of the neural network. It should be understood that the examples here are only for the convenience of understanding the application scenarios of the embodiments of the present application, and do not exhaustively list the application scenarios of the embodiments of the present application. The embodiments of the present application can also be applied to scenarios of speech processing or text processing. In the embodiments of the present application, only scenarios applied to image processing are taken as an example for detailed introduction.

The embodiments of the present application will be described below in conjunction with the drawings. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.

In order to facilitate the understanding of this solution, first, the system architecture of the image processing system provided by the embodiment of the present application will be introduced with reference to FIG. 2. Please refer to FIG. 2 first. FIG. 2 is a system architecture diagram of the image processing system provided by the embodiment of the present application. In FIG. 2, the image processing system 200 includes an execution device 210, a training device 220, a database 230 and a data storage system 240, and the execution device 210 includes a calculation module 211.

Among them, in the training phase, the database 230 stores a training data set. The training data set includes multiple training images and the label classification of each training image. The training device 220 generates a target model/rule 201 for the image and uses the database The training data set in is iteratively trained on the target model/rule 201, and a mature target model/rule 201 is obtained. The target model/rule 201 can be specifically represented as an image processing network. The image processing network obtained by the training device 220 can be applied to different systems or devices.

In the inference phase, the execution device 210 can call data, codes, etc. in the data storage system 240, and can also store data, instructions, etc. in the data storage system 240. The data storage system 240 may be placed in the execution device 210, or the data storage system 240 may be an external memory relative to the execution device 210.

The calculation module 211 may process the image collected by the execution device 210 through the image processing network to obtain the processing result, and the specific expression form of the processing result is related to the function of the image processing network.

In some embodiments of the present application, for example, in FIG. 2, the "user" can directly interact with the execution device 210, that is, the execution device 210 and the client device are integrated in the same device. However, FIG. 2 is only a schematic diagram of the architecture of the two image processing systems provided by the embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. In some other embodiments of the present application, the execution device 210 and the client device may be independent devices. The execution device 210 is equipped with an input/output interface for data interaction with the client device, and the “user” can input/output data through the client device. The output interface inputs the collected image, and the execution device 210 returns the image coordinates of the first point to the client device through the input/output interface.

In combination with the above description, the embodiment of the application provides a neural network training method and an image processing method, which are applied to the training phase and the inference phase, respectively, because the training phase and the inference phase in the image processing system provided by the embodiment of the application are The specific implementation manner of is different. The specific implementation process of the training phase and the inference phase of the embodiment of the present application will be described below.

1. Training phase

In the embodiment of the present application, the training phase refers to the process in which the training device 220 in FIG. 2 uses training data to perform a training operation. Please refer to FIG. 3. FIG. 3 is a schematic flowchart of a neural network training method provided in an embodiment of the application. The neural network training method provided in an embodiment of the application may include:

301. The training device obtains an original image and a third label category.

In the embodiment of the present application, a training data set is configured on the training device, and the training data set may include an original image and a third label category corresponding to the original image. Among them, the original image refers to an image that has not undergone perturbation processing, or it can also be an image directly collected. The third label category is the correct classification corresponding to the original image, and the third label category is the correct classification of one or more objects in the original image, which can include one or more classification categories, which are used for the supervision data in the training phase of. As an example, if an image includes a panda, the corresponding third annotation category is panda; as another example, if an image includes a panda and a frog, then the corresponding third annotation category is panda and frog. The examples given here are only to facilitate the understanding of this solution, and are not used to limit this solution.

Disturbance processing refers to slightly adjusting the pixel values of pixels in the original image based on the original image to obtain a perturbed image. The perturbed image can also be called a confrontational image. For the human eye, it is usually It is difficult to distinguish the difference between the confrontation image and the original image. Specifically, one disturbance process may be to adjust the pixel value of each pixel in the original image, or it may be the pixel value of only a part of the pixels in the original image. The disturbance can be expressed as a two-dimensional matrix, and the size of the two-dimensional matrix is consistent with the size of the original image. In order to understand this solution more intuitively, please refer to FIG. 4, which is a schematic diagram of a disturbance operation in the neural network training method provided in an embodiment of the application. Among them, A1 represents the natural image, A2 represents the disturbance, and A3 represents the confrontation image. The classification category obtained by inputting A1 into the image classification network is panda, and the classification category obtained by inputting A3 into the image classification network is gibbons, so it should be understood that, The example in FIG. 4 is only to facilitate the understanding of the concept of disturbance processing, and is not used to limit the solution.

Further, there may be restrictions on the above-mentioned disturbances, and the restrictions on disturbances may be shown by the following formula:

S={δ:||δ|| _p ≤ε};

Among them, S represents the restriction on the disturbance δ, δ represents the disturbance, ||δ|| _p represents the p-norm of δ, which can also be called the modulus length of δ, and p can be any integer greater than or equal to 1, as an example, For example, p can be 2, or p can be infinite, and ε is a fixed preset value. As an example, for example, the value of ε can be 0.3,

Or other numerical values, etc., it should be understood that this example is only for further understanding the concept of disturbance, and is not used to limit the solution.

302. The training device inputs the original image into the first feature extraction network to obtain a second robust representation generated by the first feature extraction network.

In the embodiment of the present application, after obtaining the original image, the training device inputs the original image into the first feature extraction network to obtain a second robust representation generated by the first feature extraction network.

Among them, the first feature extraction network is a convolutional neural network or a residual neural network. As an example, for example, the first feature extraction network may be a part used for feature extraction in Wide Residual Networks 34 (WRNS34); as another example, for example, the first feature extraction network may be a pre-activated residual In the part of neural network 18 (Pre-activated Residual Networks 18, PRNS 18) used for feature extraction, the first feature extraction network may also be expressed as other types of convolutional neural networks or residual neural networks, etc., which is not limited here.

Robust representation refers to the features that are not sensitive to disturbances among the features extracted from the image. The classification category corresponding to the robust representation extracted from the original image is consistent with the classification category corresponding to the robust representation extracted from the disturbed image corresponding to the original image. The second robust representation includes features that are not sensitive to disturbances among the features extracted from the original image. The second robust representation can specifically be expressed as one-dimensional data, two-dimensional data, three-dimensional data, or higher-dimensional data, etc.; the length of the second robust representation can be 500, 800, 1000 or other lengths, etc., and the details are all here. Not limited.

303. The training device inputs the original image into the second feature extraction network to obtain a second non-robust representation generated by the second feature extraction network.

In some embodiments of the present application, after obtaining the original image, the training device inputs the original image into the second feature extraction network to obtain a second non-robust representation generated by the second feature extraction network.

Among them, the second feature extraction network can also be a convolutional neural network or a residual neural network. The second feature extraction network has similar functions to the first feature extraction network, except that the second feature extraction network after training is different from the first feature extraction network after training. The weight parameters of a feature extraction network are different, so what is extracted through the first feature extraction network is a robust representation in the image, and what is extracted through the second feature extraction network is a non-robust representation in the image. For an example of the specific manifestation of the second feature extraction network, please refer to the above examples of the first feature extraction network, which will not be repeated here. In the embodiment of the present application, two specific implementation manners of the first feature extraction network and the second feature extraction network are provided, which improves the implementation flexibility of the solution.

The non-robust representation refers to the features that are sensitive to disturbance among the features extracted from the image. The classification category corresponding to the non-robust representation extracted from the original image is inconsistent with the classification category corresponding to the non-robust representation extracted from the disturbed image corresponding to the original image. The second non-robust representation includes features that are sensitive to disturbances among the features extracted from the original image. The specific form and length of the second non-robust representation are similar to those of the second robust representation. Please refer to the above description. , Do not repeat it here.

In order to understand the robust representation and the non-robust representation more intuitively, please refer to FIG. 5. FIG. 5 is a schematic diagram of the robust representation and the non-robust representation after visualization processing in the neural network training method provided by the embodiment of the application. . Among them, B1 and B2 correspond to the same original image (original image 1), B3 and B4 correspond to the same original image (original image 2), the object in original image 1 is a squirrel, and B1 is the pair extracted from original image 1. The robust representation of is obtained after visualization processing, and B2 is obtained after visualization processing of the non-robust representation extracted from the original image 1. As shown in Figure 5 by human eyes, there is a faint squirrel shape in B1, and B1 also carries There is the color of a squirrel (it cannot be shown because there is no color in the patent document), but the human eye cannot get any information from B2. The object in the original image 2 is a ship, B3 is obtained by visualizing the robust representation extracted from the original image 2, and B4 is obtained by visualizing the non-robust representation extracted from the original image 2. , Viewing Figure 5 through the human eye, B3 has the shape of a ship faintly, and B3 also carries the color of the ship (the color cannot be shown in the patent document), and the human eye cannot obtain any information from B4. That is, the feature information included in the robust representation is similar to the feature used by the human eye. On the contrary, the feature information included in the non-robust representation cannot be understood by the human eye, and the non-robust representation is noise to the human eye. It should be understood that the example in FIG. 5 is only to facilitate the understanding of the concepts of robust representation and non-robust representation, and is not used to limit this solution.

It should be noted that the embodiment of the present application does not limit the execution order of

steps

302 and 303. Step 302 can be executed first, and then step 303; or step 303 can be executed first, and then step 302 can be executed;

steps

302 and 302 can also be executed simultaneously. 303.

304. The training device combines the second robust representation and the second non-robust representation to obtain the combined first representation.

In the embodiment of the present application, in one case, after obtaining the second robust representation and the second non-robust representation, the training device will combine the second robust representation and the second non-robust representation to obtain the combined The first representation. Among them, the way of combination includes, but is not limited to, conact, addition, fusion, and multiplication.

305. The training device inputs the combined first representation into the classification network to perform a classification operation according to the combined first representation through the classification network to obtain a third classification category output by the classification network.

In the embodiment of the present application, after obtaining the combined first representation, the training device inputs the combined first representation into the classification network to perform the classification operation according to the combined first representation through the classification network to obtain the first representation output by the classification network. Three classification categories. The processing method of steps 304 and 305 may also be referred to as generating a third classification category through a standard path. Wherein, the classification network may include at least one perceptron, and the aforementioned perceptron includes at least two neural network layers, and specifically may be a double-layer fully connected perceptron. The third classification category indicates the category of the object in the original image.

306. The training device inputs the second robust representation into the classification network, so as to perform a classification operation according to the second robust representation through the classification network to obtain a fourth classification category output by the classification network.

In the embodiment of the present application, in another case, the training device may also combine the second robust representation with a first constant tensor (such as a vector with all zeros) to obtain the combined third representation, and combine The latter third representation is input into the classification network, and the classification operation is performed according to the combined third representation through the classification network to obtain the fourth classification category output by the classification network. Specifically, since the part of the first constant tensor in the combined third representation will not change, which does not include the feature information of the natural image, the classification network can use the second robust representation in the combined third representation Include feature information, perform a classification operation, and output a fourth classification category, which is the classification category of the object in the natural image. The processing method of step 306 may also be referred to as generating a fourth classification category through a robust path. Further, since the same classification network can be used in step 306 and step 305, in order to make the combined third representation consistent with the combined first representation format, the second robust representation and the first constant tensor Make a combination. If different classification networks are used in step 306 and step 305, the second robust representation can also be directly input into the classification network.

Among them, the specific implementation of the combination can be referred to the introduction in step 304, and the specific manifestation of the classification network can be referred to the introduction in step 305, which will not be repeated here. In step 306, the same classification network may be used as in step 305, or different classification networks may be used respectively. The format of the combined third representation and the format of the combined second representation may be the same. The first constant tensor refers to a tensor whose values remain unchanged in multiple trainings. It can be expressed as one-dimensional data, two-dimensional data, three-dimensional data, or higher-dimensional data, etc., in a first constant tensor The value of all constants can be the same or different. As an example, for example, the values of all constants in a first constant tensor can be 0, 1, 2, or other values, etc. As another example, for example, a first constant tensor can include 1, 2, 3, 5, 12 , 18, etc., the length of the first constant tensor is the same as the length of the second non-robust representation. The front and rear positions of the second robust representation and the first constant tensor may correspond to the front and rear positions of the second robust representation and the second non-robust representation. If in step 304 the second robust means first and the second non-robust means behind, then in step 306 the second robust means first and the first constant tensor is behind; if the second robust non-robust means behind in step 304 If the representation is in front and the second robust representation is in the back, then in step 306 the first constant tensor is in the front and the second robust representation is in the back.

307. The training device inputs the second non-robust representation into the classification network, so as to perform a classification operation according to the second non-robust representation through the classification network to obtain a fifth classification category output by the classification network.

In the embodiment of the present application, in another case, the training device may combine the second non-robust representation with a second constant tensor to obtain the combined fourth representation, and input the combined fourth representation into the classification network , To perform the classification operation according to the combined fourth representation through the classification network to obtain the fifth classification category output by the classification network. Specifically, similar to step 306, the classification network uses the feature information included in the second non-robust representation in the combined fourth representation to perform a classification operation to output a fifth classification category, which is the classification category of the object in the natural image . The processing method of step 307 may also be referred to as generating a fifth classification category through a non-robust path.

Among them, the specific implementation of the combination can be referred to the introduction in step 304, and the specific manifestation of the classification network can be referred to the introduction in step 305, which will not be repeated here. In step 307, the same classification network can be used as in steps 305 and 306, or different classification networks can be used respectively. If the same classification network is used in step 307 and steps 305 and 306, in order to ensure the consistency of the classification network in the data processing process, the second non-robust representation needs to be combined with the second constant tensor, and the combined fourth representation is The format must be consistent with the format of the combined first representation. For the meaning and specific expression of the second constant tensor, please refer to the description of the first constant tensor. The second constant tensor can be the same constant tensor as the first constant tensor, or it can be a different constant tensor. , This time there is no limitation. The front and rear positions of the second non-robust representation and the second constant tensor may correspond to the front and rear positions of the second robust representation and the second non-robust representation. If in step 304 the second robust representation is in front and the second non-robust representation is in the back, then in step 306 the second constant tensor is in front and the second non-robust representation is in the back; if the second robust representation is in step 304 If the non-representation is in front and the second robust is in the back, then in step 306, the second non-robust representation is in the front and the second constant tensor is in the back.

It should be noted that the embodiment of the present application does not limit the execution order between steps 304 and 305, step 306, and step 307. Steps 304 and 305 can be performed first, then step 306, and then step 307; or first Step 306, perform steps 304 and 305, and then perform step 307; you can also perform step 306, then perform step 307, and then perform steps 304 and 305, or you can perform steps 304 and 305, then perform step 307, and then perform Step 306, etc., the sequence of steps 304 and 305, step 306, and step 307 can be arranged arbitrarily, which is not exhaustive here. In addition, steps 304 and 305, step 306, and step 307 can also be performed at the same time.

308. The training device obtains the confrontation image and the first label category.

In the embodiment of the present application, the training device obtains the confrontation image and the first label category corresponding to the confrontation image. The counter image is an image that has undergone disturbance processing, and may also be referred to as a post-disturbance image. For the specific meaning of the disturbance, refer to the description in step 301, which will not be repeated here. The first label category is the correct category corresponding to the confrontation image, which is the correct classification of one or more objects in the confrontation image, which may include one or more classification categories, and is used for supervising data in the training phase. The first annotation category has a similar meaning to the above-mentioned third annotation category. The difference is that the first annotation category is for adversarial images, and the third annotation category is for natural images. For examples of the third annotation category, please refer to step 301. An example of labeling categories will not be repeated this time.

Specifically, the training device may perform perturbation processing on the basis of the above-mentioned natural image to obtain a counter image. In an implementation manner, the training device may obtain the aforementioned disturbance based on the gradient of the standard path, the robust path, or the non-robust path in each training process, and then obtain the disturbance image. In another implementation manner, the training device may also not rely on the aforementioned gradient to obtain the aforementioned perturbation. However, in the above two cases, the method of generating the confrontation image is different, and the foregoing two implementation methods are respectively described below.

(1) The confrontation image is generated based on gradient

Further, the above-mentioned gradient can be divided into the first gradient of the standard path, the second gradient of the robust path, and the third gradient of the non-robust path, which will be introduced separately below.

A. Obtain the disturbance based on the first gradient of the standard path

In this embodiment, step 308 may include: the training device generates a first gradient according to the function value of the second loss function, performs perturbation processing on the original image according to the first gradient to generate a confrontation image, and determines the third label category as the first One label category. Among them, the second loss function is used to indicate the similarity between the third classification category and the third label category, and the second loss function can specifically be a cross-entropy loss function, a maximum margin loss function (max-margin loss) or other types of Loss functions, etc., are not limited this time. In the embodiment of the present application, the first gradient is generated according to the similarity between the third classification category and the third annotation category, and the original image is disturbed according to the first gradient, so that the disturbance processing is more targeted, which is beneficial to speed up the first gradient. The training process of the first feature extraction network and the second feature extraction network improves the efficiency of the training process.

Specifically, in one case, the training device may generate the function value of the second loss function according to the third classification category and the third label category, and generate the first gradient according to the function value of the second loss function, and convert the first The gradient is brought into the preset function, and then multiplied by the preset coefficient to obtain the perturbation, and then the obtained perturbation is superimposed with the original image to generate a confrontation image. The preset function can be a sign function, an identity function or other functions; the value of the preset coefficient can be 0.008, 0.007, 0.006, 0.005 or other coefficient values, etc., the selection and preset of the specific preset function The value of the coefficient can be determined in combination with the actual application environment, and it is not limited here.

To further understand this scheme, an expression for generating disturbance is shown:

Among them, J(θ, x, y) represents the second loss function, θ represents the set of weights of each neural network layer in the first feature extraction network and the second feature extraction network, and x represents the input to the first feature extraction network and The second feature extracts natural images in the network, y represents the third label category,

Represents the derivation of the second loss function, sign represents the preset function is the sign function, here the preset coefficient value is 0.007, it should be understood that the example of the formula here is only to facilitate the understanding of the solution, and is not used to limit the solution , The preset function and preset coefficient can also be replaced.

In another case, the training device can also generate the function value of the second loss function according to the third classification category and the third label category, and generate the first gradient according to the function value of the second loss function, and the first gradient Multiply the preset coefficient to obtain the disturbance, and then superimpose the obtained disturbance with the original image to generate a confrontation image.

B. Obtain the disturbance based on the second gradient of the robust path

In this embodiment, step 308 may include: the training device generates a second gradient according to the function value of the third loss function; perturbs the original image according to the second gradient to generate a confrontation image, and determines the third label category as the first One label category. Among them, the third loss function is used to indicate the similarity between the fourth classification category and the third label category, and the type of the third loss function is similar to the type of the second loss function, and no further examples are given here. The difference between the second gradient and the first gradient is that the second gradient is obtained by performing gradient derivation on the function value of the third loss function, and the first gradient is obtained by performing gradient derivation on the function value of the second loss function. Further, combined with the above formula to explain, in the case of using the first gradient to generate the disturbance, J(θ, x, y) in the above formula represents the second loss function; in the case of using the second gradient to generate the disturbance, the above J(θ, x, y) in the formula represents the third loss function. In the embodiment of the present application, the original image is disturbed according to the similarity between the fourth classification category output by the second robust representation and the third annotation category according to the classification network, so that the disturbance processing and the first feature extraction network are more closely related. It is pertinent and helps to improve the feature extraction capability of the first feature extraction network for robust representation.

Specifically, in one case, the training device may generate the function value of the third loss function according to the fourth classification category and the third label category, and generate the second gradient according to the function value of the third loss function, and convert the second The gradient is brought into the preset function, and then multiplied by the preset coefficient to obtain the perturbation, and then the obtained perturbation is superimposed with the original image to generate a confrontation image. For the specific implementation of the preset function and the preset coefficient, please refer to the description in the above case A, which will not be repeated here. In another case, the training device can also generate the function value of the third loss function according to the fourth classification category and the third label category, and generate the second gradient according to the function value of the second loss function, and the second gradient Multiply the preset coefficient to obtain the disturbance, and then superimpose the obtained disturbance with the original image to generate a confrontation image.

C. Obtain the disturbance based on the third gradient of the non-robust path

In this embodiment, step 308 may include: the training device generates a third gradient according to the function value of the fourth loss function; perturbs the original image according to the third gradient to generate a confrontation image, and determines the third label category as the first One label category. Among them, the fourth loss function is used to indicate the similarity between the fifth classification category and the third label category, and the type of the fourth loss function is similar to the type of the second loss function, and no further examples are given here. The difference between the third gradient and the first gradient is that the third gradient is obtained by performing gradient derivation on the function value of the fourth loss function, and the first gradient is obtained by performing gradient derivation on the function value of the second loss function. Further, combined with the above formula to explain, in the case of using the first gradient to generate the disturbance, J(θ, x, y) in the above formula represents the second loss function; in the case of using the third gradient to generate the disturbance, the above J(θ, x, y) in the formula represents the fourth loss function. In the embodiment of the present application, the original image is disturbed according to the similarity between the fifth classification category output by the second non-robust representation and the third label category according to the classification network, so that the disturbance processing and the second feature extraction network are different from each other. It is more pertinent and helps to improve the feature extraction ability of the first feature extraction network for non-robust representations.

Specifically, in one case, the training device can generate the function value of the fourth loss function according to the fifth classification category and the third label category, and generate the third gradient according to the function value of the fourth loss function, and the third The gradient is brought into the preset function, and then multiplied by the preset coefficient to obtain the perturbation, and then the obtained perturbation is superimposed with the original image to generate a confrontation image. For the specific implementation of the preset function and the preset coefficient, please refer to the description in the above case A, which will not be repeated here. In another case, the training device may also generate the function value of the fourth loss function according to the fifth classification category and the third label category, and generate the third gradient according to the function value of the fourth loss function, and the third gradient Multiply the preset coefficient to obtain the disturbance, and then superimpose the obtained disturbance with the original image to generate a confrontation image.

It should be noted that steps 301 to 307 in the embodiment of this application are optional steps, but if the confrontation image is obtained based on the gradient of the aforementioned standard path, robust path or non-robust path, steps 301 to 307 are required. Steps, and the execution order of steps 301 to 307 is before step 308.

(2) The adversarial image does not depend on gradient generation

In this embodiment, the training data set configured on the training device may be pre-configured with a confrontation image and a first label category corresponding to the confrontation image. Step 308 may include: the training device obtains the confrontation image and the confrontation image from the training data set. The corresponding first label category.

Further, for the generation method of the confrontation image in the training data set, after the training device obtains the natural image, it generates a disturbance matrix in the form of a two-dimensional matrix according to the size of the two-dimensional matrix corresponding to the natural image. The value satisfies the constraint in step 301. The value of each parameter in the aforementioned disturbance matrix can be randomly generated, or the disturbance matrix can be generated in the order from small to large within the constraint range of step 301, or it can be generated in step 301. Within the constraint range of, the disturbance matrix is generated according to the order from large to small, or the disturbance matrix can also be generated according to other laws, etc., which is not limited here.

309. The training device inputs the confrontation image into the first feature extraction network to obtain a first robust representation generated by the first feature extraction network.

In the embodiment of the present application, after acquiring the confrontation image, the training device inputs the confrontation image into the first feature extraction network to obtain the first robust representation generated by the first feature extraction network. Among them, the concept of robust representation has been introduced in step 302, and will not be repeated here. The difference between the first robust representation and the second robust representation is that the second robust representation is the feature information extracted from the original image, and the first robust representation is the feature information extracted from the confrontation image.

310. The training device inputs the confrontation image into the second feature extraction network to obtain the first non-robust representation generated by the second feature extraction network.

In the embodiment of the present application, after acquiring the confrontation image, the training device inputs the confrontation image into the second feature extraction network to obtain the first non-robust representation generated by the second feature extraction network. Among them, the concept of non-robust representation has been introduced in step 303, and will not be repeated here. The difference between the first non-robust representation and the second non-robust representation is that the second non-robust representation is the feature information extracted from the original image, and the first non-robust representation is the feature extracted from the confrontation image. information.

It should be noted that the embodiment of the present application does not limit the execution sequence between step 309 and step 310. Step 309 can be performed first, and then step 310; step 310 can be performed first, and step 309 can be performed again; and steps can also be performed at the same time. 309 and 310.

311. The training device combines the first robust representation and the first non-robust representation to obtain a combined second representation.

In the embodiment of the present application, after obtaining the first robust representation and the first non-robust representation, the training device combines the first robust representation and the first non-robust representation to obtain the combined second representation. For the way of combination, please refer to the description in step 304, which will not be repeated here.

312. The training device inputs the combined second representation into the classification network to obtain a sixth classification category output by the classification network.

In the embodiment of the present application, after obtaining the combined second representation, the training device inputs the combined second representation into the classification network to obtain the sixth classification category output by the classification network. For the specific manifestation of the classification network, please refer to the description in step 305. The classification network used in step 312 and the classification network used in step 305 may be the same classification network or different classification networks. The meaning of the sixth classification category is similar to that of the third classification category, except that the sixth classification category indicates the category of the object in the confrontation image.

313. The training device judges whether the sixth classification category is the same as the first label category, if they are not the same, go to step 314, and if they are the same, go to step 316.

In the embodiment of the present application, after obtaining the sixth classification category output by the classification network, the training device determines whether the sixth classification category is the same as the first label category, that is, whether the sixth classification category output by the classification network corresponds to the confrontation image The correct classification category. If they are not the same, go to step 314, and if they are the same, go to step 316.

Optionally, in a case where the first six classification categories are different from the first annotation category, the training device determines the sixth classification category as the second annotation category. Among them, the second label category refers to the error category corresponding to the confrontation image, which is the misclassification of one or more objects in the confrontation image, which can include one or more classification categories, and is also used for the supervision data in the training phase. . The meaning of the second annotation category is similar to the meaning of the first annotation category, except that the second annotation category is a misclassification corresponding to the confrontation image, and the first annotation category is a correct classification corresponding to the confrontation image.

In the embodiment of the present application, a method for obtaining the second label category is provided, which is simple to operate and does not require additional steps, which saves computer resources.

314. The training device inputs the first robust representation into the classification network to obtain the first classification category output by the classification network.

In the embodiment of the present application, the training device inputs the first robust representation into the classification network to obtain the first classification category output by the classification network. The meaning of the classification network can refer to the description in step 305 above, and step 314 can use the same classification network as in step 305, or a different classification network can be used. The first classification category is the classification category of the object in the confrontation image.

315. The training device inputs the first non-robust representation into the classification network to obtain a second classification category output by the classification network.

In the embodiment of the present application, the training device inputs the first non-robust representation into the classification network to obtain the second classification category output by the classification network. The meaning of the classification network can refer to the description in step 305 above, and step 315 can use the same classification network as in step 305, or a different classification network can be used. The second classification category is the classification category of the object in the confrontation image.

316. The training device performs iterative training on the first feature extraction network and the second feature extraction network according to the loss function until the convergence condition is met.

In the embodiment of the present application, the training device may perform iterative training on the first feature extraction network and the second feature extraction network according to the loss function until the convergence condition is satisfied. Specifically, the training device generates a gradient value according to the function value of the loss function, and uses the gradient value to perform back propagation to update the neuron weights of the first feature extraction network and the second feature extraction network, so as to realize the extraction of the first feature A training of the network and the second feature extraction network.

Wherein, the convergence condition may be that the convergence condition of the loss function is satisfied, or the number of iterations meets the preset number, etc. The loss function is used to indicate the similarity between the classification category and the label category, and the similarity between the classification category and the label category can also be understood as the difference between the classification category and the label category. Since steps 301 to 307 and step 313 are optional steps, and steps 301 to 307 can be all executed or not executed, or partly executed and partly not executed; if step 313 is executed, the training device can pass step 315 Going to step 316 may also be going to step 316 through step 313. However, the specific meaning of the loss function in the foregoing multiple cases is not always complete. The various situations described above are separately described below.

In one case, if steps 301 to 307 are not executed, and step 313 is executed, then step 315 is entered into step 316, that is, if the sixth classification category is different from the first label category, step 316 is entered, then Step 316 may include: the training device trains the first feature extraction network and the second feature extraction network according to the first loss function. The first loss function is used to indicate the similarity between the first classification category and the first annotation category, and is used to indicate the similarity between the second classification category and the second annotation category. The first loss function may specifically be expressed as a cross-entropy loss function, a maximum interval loss function, or other types of loss functions, which are not limited this time. In order to understand the first loss function more intuitively, an expression of the first loss function is shown as follows:

Among them, L _AS (θ, x, y) represents the first loss function, l(h _r (x _adv ; θ ₁ ), y ₁ ) represents the similarity between the first classification category and the first label category, x _adv Represents the confrontation image, θ ₁ represents the weight in the first feature extraction network, and y ₁ represents the first label category,

Represents the similarity between the second classification category and the second label category, θ ₂ represents the weight in the second feature extraction network,

It represents the second label category. It should be understood that the example of the first loss function this time is only for the convenience of understanding the solution, and is not used to limit the solution.

Specifically, after obtaining the first classification category and the second classification category through

steps

315 and 316, the training device generates the function value of the first loss function, obtains the gradient value corresponding to the function value of the first loss function, and uses the The gradient value corresponding to the function value of the loss function is back-propagated to update the neuron weights of the first feature extraction network and the second feature extraction network, thereby completing a training of the first feature extraction network and the second feature extraction network .

In another case, if steps 301 to 303 and steps 304 and 305 are executed,

steps

306 and 307 are not executed, and step 313 is executed, which is to enter step 316 through step 315, then step 316 may include: the training device according to the first The loss function and the second loss function train the first feature extraction network and the second feature extraction network. The second loss function is used to indicate the similarity between the third classification category and the third annotation category, and the third annotation category is the correct category corresponding to the original image. In the embodiments of this application, during the training process, not only the confrontation images are used to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also natural images are used to train the first feature extraction network and the second feature extraction network. To further improve the accuracy of the trained first feature extraction network and the trained second feature extraction network in the process of processing natural images.

Specifically, the training device generates the first loss function after obtaining the third label category through step 301, after obtaining the third classification category through step 305, and after obtaining the first classification category and the second classification category through

steps

315 and 316 And the function value of the second loss function. The training device can generate a total function value according to the function value of the first loss function and the function value of the second loss function, and use the total function value to perform a training on the first feature extraction network and the second feature extraction network; specifically, training The device may directly add the function value of the first loss function and the function value of the second loss function to obtain the total function value, or the training device may be the function value of the first loss function and the function value of the second loss function respectively After allocating different weights, add them to get the total function value. The specific steps of using function values to complete a training can refer to the above description, which will not be repeated here.

In another case, if steps 301 to 303 and step 306 are performed,

steps

304, 305, and 307 are not performed, and step 313 is performed, which is to enter step 316 through step 315, then step 316 may include: the training device according to the first The loss function and the third loss function train the first feature extraction network and the second feature extraction network. Among them, the third loss function is used to represent the similarity between the fourth classification category and the third label category, and the third label category is the correct category corresponding to the original image. In the embodiments of this application, not only the confrontation images are used to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also natural images are used to train the robust representation extraction capabilities of the first feature extraction network to further improve training. The accuracy of the second feature extraction network.

Specifically, the training device generates the first loss function after obtaining the third label category in step 301, after obtaining the fourth classification category in step 306, and after obtaining the first classification category and the second classification category in

steps

315 and 316. And the function value of the third loss function. The training device uses the function values of the first loss function and the third loss function to complete a training of the first feature extraction network and the second feature extraction network. For specific implementation methods, please refer to the function of using the first loss function and the second loss function. The description of training is not repeated here.

In another case, if steps 301 to 303 and step 307 are executed, steps 304 to 306 are not executed, and step 313 is executed to enter step 316 through step 315, step 316 may include: training the device according to the first loss function And the fourth loss function to train the first feature extraction network and the second feature extraction network. Among them, the fourth loss function is used to represent the similarity between the fifth classification category and the third label category, and the third label category is the correct category corresponding to the original image. In the embodiments of this application, not only the confrontation images are used to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also natural images are used to train the second feature extraction network's extraction capabilities for non-robust representations to further improve The accuracy of the second feature extraction network after training.

Specifically, the training device generates the first loss function after obtaining the fourth label category through step 301, after obtaining the fifth classification category through step 306, and after obtaining the first classification category and the second classification category through

steps

315 and 316 And the function value of the fourth loss function. The training device uses the function values of the first loss function and the fourth loss function to complete a training of the first feature extraction network and the second feature extraction network. For specific implementation methods, please refer to the function of using the first loss function and the second loss function. The description of training is not repeated here.

In another case, if steps 301 to 306 are performed, step 307 is not performed, and step 313 is performed, which is to enter step 316 through step 315, then step 316 may include: training the device according to the first loss function and the second loss function And the third loss function to train the first feature extraction network and the second feature extraction network. Specifically, after the training device generates the function value of the first loss function, the function value of the second loss function, and the function value of the third loss function, it can be based on the functions of the first loss function, the second loss function, and the third loss function. Value, generate a total function value, and then use the total function value to train the first feature extraction network and the second feature extraction network. For a specific implementation manner, please refer to the description of training based on the function values of the first loss function and the second loss function, which will not be repeated here.

In another case, if steps 301 to 305 and step 307 are executed, step 306 is not executed, and step 313 is executed, to enter step 316 through step 315, step 316 may include: training the device according to the first loss function and the first loss function. The second loss function and the fourth loss function are used to train the first feature extraction network and the second feature extraction network. For specific implementation methods, please refer to the above description, which will not be repeated here.

In another case, if steps 301 to 303 and

steps

306 and 307 are executed, steps 304 and 305 are not executed, and step 313 is executed, which is to enter step 316 through step 315, then step 316 may include: the training device according to the first The loss function, the third loss function, and the fourth loss function are iteratively trained on the first feature extraction network and the second feature extraction network until the convergence condition is met. For specific implementation methods, please refer to the above description, which will not be repeated here.

In another case, if steps 301 to 307 are executed, and step 313 is executed, it is to enter step 316 through step 315, then step 316 may include: the training device performs an evaluation of the first feature according to the first loss function and the fifth loss function. The extraction network and the second feature extraction network are trained. Among them, the fifth loss function is used to represent the similarity between the fourth classification category and the third label category, and is used to represent the similarity between the fifth classification category and the third label category, and is used to represent the sixth category The similarity between the category and the third labeled category, which is the correct category corresponding to the original image. The fifth loss function may specifically be expressed as a cross-entropy loss function, a maximum interval loss function, or other types of loss functions, which are not limited this time. In order to understand the fifth loss function more intuitively, an expression of the fifth loss function is shown as follows:

L _total (θ, x, y) = L _AS (θ, x, y) + L _ST (θ, x, y);

L _ST (θ,x,y)=l(h _s (x;θ ₃ ),y ₂ )+l(h _r (x;θ ₁ ),y ₂ )+l(h _n (x;θ ₂ ) , Y ₂ );

Among them, L _total (θ, x, y) represents the total loss function, and L _AS (θ, x, y) represents the first loss function. For the specific meaning, please refer to the above description, which will not be repeated here. L _ST (θ, x , Y) represents the fifth loss function, l(h _s (x; θ ₃ ), y ₂ ) represents the similarity between the fourth classification category and the third label category, x represents the original image, and θ ₃ represents the first feature The weight of the extraction network and the second feature extraction network, y ₂ represents the third annotation category, l(h _r (x; θ ₁ ), y ₂ ) represents the similarity between the fifth classification category and the third annotation category, l (h _n (x; θ ₂ ), y ₂ ) represents the similarity between the sixth classification category and the third label category. For the functions of other letters in the above formula, please refer to the above description of the first loss function. It is understood that the example of the fifth loss function this time is only to facilitate the understanding of the solution, and is not used to limit the solution.

In the embodiments of the present application, while improving the processing capabilities of the first feature extraction network and the second feature extraction network on confrontation images, the processing capabilities of the first feature extraction network and the second feature extraction network on natural images are also improved, that is, Whether it is a natural image or an adversarial image, the trained first feature extraction network and the second feature extraction network can accurately extract robust and non-robust representations, which expands the application scenarios of this solution.

In another case, if steps 301 to 307 are not executed, and step 316 is entered through step 313, the training device no longer trains the first feature extraction network and the second feature extraction network according to the loss function, and It is to re-enter step 308 to obtain a new confrontation image and a new first label category, that is, to enter a new training process.

In another case, if steps 301 to 303 and steps 304 and 305 are executed,

steps

306 and 307 are not executed, and step 316 is entered through step 313, then step 316 may include: training the device according to the second loss function, The first feature extraction network and the second feature extraction network are trained.

In another case, if steps 301 to 303 and step 306 are executed,

steps

304, 305, and 307 are not executed, and step 316 is entered through step 313, then step 316 may include: the training device performs the calculation according to the third loss function The first feature extraction network is trained.

In another case, if steps 301 to 303 and step 307 are performed, steps 304 to 306 are not performed, and step 316 is entered through step 313, then step 316 may include: the training device performs a calculation of the second loss function according to the fourth loss function. The feature extraction network is trained.

In another case, if steps 301 to 306 are performed, step 307 is not performed, and step 316 is entered through step 313, then step 316 may include: the training device performs a calculation of the first loss function and the third loss function according to the second loss function and the third loss function. The feature extraction network and the second feature extraction network are trained.

In another case, if steps 301 to 305 and step 307 are executed, step 306 is not executed, and step 316 is entered through step 313, step 316 may include: training the device according to the second loss function and the fourth loss function, Train the first feature extraction network and the second feature extraction network.

In another case, if steps 301 to 303 and

steps

306 and 307 are executed, steps 304 and 305 are not executed, and step 316 is entered through step 313, step 316 may include: training the device according to the third loss function and the first Four loss functions, training the first feature extraction network and the second feature extraction network.

In another case, if steps 301 to 307 are all performed, and step 316 is entered through step 313, step 316 may include: training the device to perform the first feature extraction network and the second feature extraction network according to the fifth loss function Perform iterative training.

In another case, if step 313 is not performed, the specific implementation of step 316 can refer to the description of various situations in which step 313 is performed and step 316 is entered through step 315, which is not repeated here.

In the embodiment of this application, if the sixth classification category is the same as the first annotation category, it is proved that the disturbance of the image after the disturbance is too slight. For the first feature extraction network and the second feature extraction network, the processing method is the same as that of the natural The image processing methods are not much different, and the purpose of training here is to enhance the ability of the first feature extraction network and the second feature extraction network to separate robust and non-robust representations from images with larger disturbances. Only in the sixth When the classification category is different from the first label category, subsequent training operations are performed to improve the efficiency of the training process.

317. The training device outputs the trained first feature extraction network and the trained second feature extraction network.

In the embodiment of the present application, after determining that the convergence condition is satisfied, the training device will output the trained first feature extraction network and the trained second feature extraction network, the trained first feature extraction network and the trained second feature The extraction network can be used as the feature extraction part of various image processing networks, that is, the trained first feature extraction network and the trained second feature extraction network can be combined with the high-level feature processing network to realize various functions. As an example, the aforementioned various functions may include one or more of the following: image classification, image recognition, image segmentation, or image detection. As another example, the aforementioned function may also be to determine the image category, for example, to determine whether the image is a natural image or a confrontational image.

In the embodiments of this application, the technicians discovered during the research process that through adversarial training, the neural network only extracts the robust representation from the input image, while discarding the non-robust representation, resulting in a decrease in the accuracy of the neural network when processing the original image. However, in the embodiment of the present application, the trained first feature extraction network and the trained second feature extraction network can respectively extract the robust representation and the non-robust representation in the input image, which avoids the mixing of the two and leads to robustness. The reduction in robustness can also retain both the robust representation and the non-robust representation in the input image, thereby avoiding the reduction in accuracy and improving the robustness and accuracy of the neural network at the same time.

Second, the reasoning stage

In the embodiment of the present application, the inference stage refers to the process in which the above-mentioned execution device 210 uses the trained image processing network to process the input image. Since the first feature extraction network after training and the second feature extraction network after training are obtained through the respective embodiments corresponding to FIG. 3, the first feature extraction network after training and the second feature extraction network after training are obtained. And, the combination of various high-level feature processing network layers can implement various functions. The specific implementation functions have been introduced in step 317. The two types of image processing networks in step 317 are introduced below.

The first introduction is the image processing network where the processing target is the object in the image, that is, the above-mentioned: image classification, image recognition, image segmentation, or image detection. Please refer to FIG. 6. FIG. 6 is a schematic flowchart of an image processing method provided by an embodiment of the present application. The image processing method provided by an embodiment of the present application may include:

601. The execution device acquires a first image.

In the embodiment of the present application, the execution device may collect the first image in real time, may also obtain the first image from a gallery stored by the execution device, or may also be the first image downloaded through a wireless or wired network. Among them, the first image may be an original image or a confrontation image. Since the execution device may be specifically represented as a mobile phone, a computer, a wearable device, an autonomous vehicle, a smart home appliance, or a chip, etc., different types of execution devices may obtain the first image in different ways. As an example, for example, if the execution device is a mobile phone, the execution device may collect and obtain the first image through a camera on the mobile phone, or may download and obtain the first image through a browser. As another example, for example, the execution device is an autonomous driving vehicle, and the autonomous driving vehicle can obtain the first image through sensor collection, etc. The specific execution device acquiring the first image can be determined in combination with actual application scenarios and application products, which is not done here. Go into details.

602. The execution device inputs the first image into the first feature extraction network to obtain a third robust representation generated by the first feature extraction network.

In the embodiment of the present application, after acquiring the first image, the execution device inputs the first image into the first feature extraction network, so that the first feature extraction network generates the first image corresponding to the first image according to the input first image. Three robust representations. For the specific expression form and the meaning of the robust representation of the first feature extraction network, please refer to the description in the embodiment corresponding to FIG. 3, which will not be repeated here.

603. The execution device inputs the first image into the second feature extraction network to obtain a third non-robust representation generated by the first feature extraction network.

In the embodiment of the present application, after acquiring the first image, the execution device inputs the first image into the second feature extraction network, so that the second feature extraction network generates the first image corresponding to the first image according to the input first image. Three non-robust representations. For the specific expression form of the second feature extraction network and the meaning of the non-robust representation, please refer to the description in the embodiment corresponding to FIG. 3, which will not be repeated here.

604. In the first case, the execution device combines the third robust representation and the third non-robust representation to obtain a combined fourth representation.

In the embodiment of the present application, in the first case, the execution device combines the third robust representation and the third non-robust representation to obtain the combined fourth representation. For the combination method and the specific implementation of step 604, please refer to the description of step 304 in the corresponding embodiment in FIG. 3. The first case refers to the case where the accuracy of the output result of the image processing network is high, and the specific circumstances can be combined. The actual application scenario is determined, and there is no limitation here.

605. The execution device outputs a first processing result corresponding to the first image according to the combined fourth representation through the feature processing network.

In the embodiment of the present application, after obtaining the combined fourth representation, the execution device inputs the combined fourth representation into the feature processing network, so that the feature processing network outputs corresponding to the first image according to the combined fourth representation The first processing result. Among them, the specific implementation mode of the feature processing network and the specific expression mode of the first processing result are related to the function of the entire image processing network. As an example, for example, the function of the image processing network is image classification, the feature processing network may be a classification network, and the first processing result is used to indicate the classification category of the entire image; further, the classification network may specifically be represented as a network that includes at least one perceptron. The aforementioned perceptron can be a double-layer fully connected perceptron. As another example, for example, the function of the image processing network is image recognition, the feature processing network may be a recognition network, and the first processing result is used to indicate the content recognized from the image, such as text content in the image. As another example, for example, the function of the image processing network is image segmentation, the feature processing network may include a classification network, which is used to generate the classification category of each pixel in the image, and then use the classification of each pixel in the image The category divides the image, and the first processing result is the divided image. As another example, for example, the function of the image processing network is image detection, the first processing result may be specifically expressed as a detection result, and the detection result indicates which objects are included in the first image, that is, it may indicate at least one object included in the first image Optionally, the detection result can also include the position information of each object in the aforementioned at least one object, etc., which can be specifically determined in combination with actual product requirements, and will not be exhaustive here. In the embodiments of the present application, a variety of specific implementation manners of the image processing network are provided, which expands the application scenarios of this solution and improves the implementation flexibility of this solution.

Specifically, if the feature processing network is a classification network, the classification network on the execution device may perform a classification operation according to the combined fourth representation, and output a classification category corresponding to the first image. If the feature processing network is a recognition network, the recognition network on the execution device can perform a recognition operation according to the combined fourth representation, and output a recognition result corresponding to the first image. All application scenarios are not exhaustively listed here.

606. In the second case, the execution device outputs the first processing result corresponding to the first image according to the third robust representation through the feature processing network.

In the embodiment of the present application, in the second case, the execution device may also input the third robust representation into the feature extraction network, so that the feature extraction network outputs the first image corresponding to the first image according to the third robust representation. One processing result. For a specific implementation manner, refer to the description in step 306 in the embodiment corresponding to FIG. 3. Among them, the first situation and the second situation are different situations. The second situation refers to a situation where the robustness of the output result of the image processing network is required to be high, or the second situation refers to the situation where the image processing network is at a high level. The state of risk, that is, when the probability that the input image is a disturbed image is high, etc., the specific situation can be determined in combination with the actual application scenario, which is not limited here.

Specifically, if the feature processing network is a classification network, the classification network on the execution device may perform the classification operation according to the third robust representation, and output the classification category corresponding to the first image. If the feature processing network is a recognition network, the recognition network on the execution device can perform the recognition operation according to the third robust representation, and output the recognition result corresponding to the first image, etc. All application scenarios are not exhaustively listed here.

In the embodiment of the present application, the image processing network includes both a robust path and a standard path, and the user can flexibly choose which path to use according to the actual situation, which expands the application scenarios of the solution and improves the flexibility of the solution.

607. In the third case, the execution device outputs the first processing result corresponding to the first image according to the third non-robust representation through the feature processing network.

In some embodiments of the present application, in the third case, the execution device may also input the third non-robust representation into the feature extraction network, so that the feature extraction network outputs the same as the first image according to the third non-robust representation. Corresponding to the first processing result, where the third situation is different from the first situation and the second situation. For a specific implementation manner, refer to the description in step 307 in the embodiment corresponding to FIG. 3.

Specifically, if the feature processing network is a classification network, the classification network on the execution device may perform the classification operation according to the third non-robust representation, and output the classification category corresponding to the first image. If the feature processing network is a recognition network, the recognition network on the execution device may perform the recognition operation according to the third non-robust representation, and output the recognition result corresponding to the first image. All application scenarios are not exhaustively listed here. In the embodiment of the present application, the provided image processing method falls into the specific application scenario of image classification, which improves the degree of integration with the application scenario.

It should be noted that step 607 is an optional step. If step 607 is not performed, the execution can end after step 605 is performed or after step 606 is performed. In addition, step 605, step 606, and step 607 shown in the above embodiment are in a parallel relationship. In some embodiments, step 605, step 606, and step 607 can also be performed at the same time, as an example, such as

steps

605 and 606. Can be executed, or

steps

605 and 607 can be executed, or

steps

606 and 607 can be executed, or steps 605 to 607 are executed, etc., which specific steps are executed can be determined in conjunction with specific application scenarios, here Not limited.

To further understand this solution, please refer to FIG. 7. FIG. 7 is a schematic diagram of the image processing network in the image processing method provided by the embodiment of the application. In FIG. 7, an image processing network is taken as an image classification network as an example. The image processing network includes a first feature extraction network, a second feature extraction network, and a classification network. The first image is input into the first feature extraction network and the second feature extraction network, respectively, to obtain a robust representation generated by the first feature extraction network and a non-robust representation generated by the second feature extraction network. The classification network in Fig. 7 includes three paths, namely robust paths, standard paths, and non-robust paths. Among them, the robust path refers to the classification network based on the robust representation; the standard path refers to the combination of the robust representation and the non-robust representation, and the classification network performs classification operations based on the combined representation; the non-robust path refers to The classification network performs classification operations based on non-robust representations. Figure 7 shows that the robust path, standard path, and non-robust path use the same classification network. It should be understood that the example in Figure 7 is only to facilitate understanding of the solution. In other implementations, the robust path The standard path and the non-robust path can use three different classification networks.

The second introduction is the image processing network whose processing target is an image, that is, the image processing network is used to determine whether the input is a natural image or an adversarial image.

Please refer to FIG. 8. FIG. 8 is a schematic flowchart of an image processing method provided by an embodiment of the present application. The image processing method provided by an embodiment of the present application may include:

801. The execution device acquires a first image.

802. The execution device inputs the first image into the first feature extraction network to obtain a third robust representation generated by the first feature extraction network.

803. The execution device inputs the first image into the second feature extraction network to obtain a third non-robust representation generated by the first feature extraction network.

In the embodiment of the present application, for the specific implementation of steps 801 to 803, please refer to the description of steps 601 to 603 in the embodiment corresponding to FIG. 6, which will not be repeated here.

804. The execution device outputs a first processing result corresponding to the first image according to the third robust representation and the third non-robust representation through the feature processing network, the first processing result indicates that the first image is the original image, or the first A processing result indicates that the first image is a disturbed image.

In the embodiment of the present application, after the execution device obtains the third robust representation and the third non-robust representation, it can output the first processing according to the third robust representation and the third non-robust representation through the feature processing network result. Among them, the feature processing network may include at least one perceptron. For the meaning of the perceptron, refer to the description of step 305 in the embodiment corresponding to FIG. 3.

Specifically, in an implementation manner, step 804 may include: after obtaining the third robust representation and the third non-robust representation, the execution device inputs the third robust representation and the third non-robust representation to the feature processing In the network, through the feature processing network, the seventh classification category corresponding to the first image is determined according to the robust representation, and the eighth classification category corresponding to the first image is determined according to the non-robust representation. More specifically, in one case, the feature processing network may include a classification network, and the execution device uses a classification network in the feature processing network to sequentially perform two classification operations to obtain the seventh classification category and the eighth classification category, respectively. In another case, the feature processing network may include two classification networks, and the execution device uses the two classification networks in the feature processing network to perform two classification operations in parallel to obtain the seventh classification category and the eighth classification category, respectively.

Furthermore, the execution device judges whether the seventh classification category is consistent with the eighth classification category through the feature processing network. In the case that the seventh classification category is consistent with the eighth classification category, the first processing result output by the feature processing network indicates that the first image is the original Image; In the case where the seventh classification category is inconsistent with the eighth classification category, the first processing result output by the feature processing network indicates that the first image is a disturbed image. Wherein, the first processing result may be specifically expressed in a text form, as an example, for example, the first processing result may be specifically expressed as a "natural image" or a "confrontational image". The first processing result can also be expressed in the form of characters. As an example, for example, the first processing result is specifically expressed as "0 0.3 1 0.7". In the aforementioned characters, 0 can refer to a natural image, and 1 can refer to a confrontation image, that is, a natural image has a probability of 0.3 and a probability of 0.7 is The confrontation image, so that the first processing result indicates that the first image is a confrontation image. As another example, for example, the first processing result is specifically expressed as "0.3 0.7", 0.3 indicates the probability that the first image is a natural image, 0.7 indicates the probability that the first image is an adversarial image, etc., so that the first processing result indicates the first image To counter the image. It should be understood that the example of the first processing result here is only to facilitate understanding of the solution, and is not used to limit the solution.

In the embodiment of the present application, by judging whether the seventh classification category and the eighth classification category are consistent, to determine whether the first image is the original image or the adversarial image, the method is simple and the operability is strong.

In another case, step 804 may include: combining the third robust representation and the third non-robust representation, and performing a detection operation according to the combined fifth representation to output a detection result corresponding to the first image, The detection result is a first processing result. For the combination mode and the specific expression mode of the fifth representation after the combination, please refer to the description in the embodiment corresponding to FIG. 3. The detection network may include at least one perceptron. For the meaning of the perceptron, refer to the description of step 305 in the embodiment corresponding to FIG. 3. For the specific manifestation of the detection result, please refer to the description of the first processing result in the previous case, which will not be repeated here.

In the embodiment of the present application, another implementation manner of determining whether the first image is the original image or the confrontation image is provided, which enhances the implementation flexibility of the solution.

In the embodiments of the present application, not only can the feature information extracted by the first feature extraction network and the second extraction network be used to obtain the processing result corresponding to the object in the image, but also the processing result corresponding to the entire image can be obtained, that is, used for Judging whether the image is the original image or the disturbed image expands the application scenarios of this solution.

In the embodiments of this application, the technicians discovered in the research process that only the robust representation is extracted, and the non-robust representation is discarded, which leads to a decrease in the accuracy of the neural network when processing the original image. The first feature extraction network and the second feature extraction network extract the robust representation and non-robust representation in the input image, which not only avoids the mixing of the two and leads to the reduction of robustness, but also retains the robust representation and the non-robust representation in the input image. Non-robust representation, thereby avoiding the reduction of accuracy rate, and improving the robustness and accuracy of the neural network at the same time.

In order to have a more intuitive understanding of the beneficial effects brought about by the embodiments of the present application, the beneficial effects brought about by the embodiments of the present application are further introduced in combination with the data in the following table.

Table 1

To	SS	RR	NN
对抗训练Confrontation training	89.089.0	89.089.0	10.010.0
迭代优化Iterative optimization	86.886.8	79.979.9	81.981.9
本申请实施例Examples of this application	94.894.8	91.891.8	93.893.8

Table 1 takes the first feature extraction network and the second feature extraction network as the feature extraction part of WRNS34 as an example, where S refers to the standard data set, which can include natural images and confrontation images; R refers to confrontation Data set, which only includes confrontation images; N refers to the natural data set, which only includes natural images. Adversarial training (AT) and iterative optimization are the two current training methods. From the data shown in Table 1, it can be seen that when processing the images in the three data sets of S, R, and N, The training methods provided in the application embodiments all have the highest accuracy rates, that is, the embodiments of the present application provide a training scheme that can improve robustness and accuracy at the same time.

In addition, we also conduct experiments in a data set with a ratio of natural samples to adversarial samples of one to one, that is, through the image processing network corresponding to Figure 8 to predict whether the input image is a natural image or an adversarial image, and through the use of iterative optimization The trained image processing network predicts whether the input image is a natural image or an adversarial image. The results are as follows:

Table 2

To	迭代优化Iterative optimization	本申请实施例Examples of this application
检测精度Detection accuracy	4.94.9	64.864.8

In Table 2, the first feature extraction network and the second feature extraction network are both feature extraction parts in WRNS34 as an example. Detection accuracy refers to the proportion of images whose prediction results match the actual situation in the total input images. Obviously, the image processing network is obtained through the training method provided in the embodiment of the application.

On the basis of the embodiments corresponding to FIGS. 1 to 8, in order to better implement the above solutions of the embodiments of the present application, the following also provides related equipment for implementing the above solutions. For details, refer to FIG. 9, which is a schematic structural diagram of a neural network training device provided by an embodiment of the application. The training device 900 of the neural network may include an input module 901 and a training module 902. Among them, the input module 901 is used to input the confrontation image into the first feature extraction network and the second feature extraction network to obtain the first robust representation generated by the first feature extraction network and the first non-robust representation generated by the second feature extraction network. Robust representation, where the counter image is an image that has been subjected to disturbance processing on the original image, robust representation refers to features that are not sensitive to disturbance, and non-robust representation refers to features sensitive to disturbance. The input module 901 is further configured to input the first robust representation into the classification network to obtain the first classification category output by the classification network, and input the first non-robust representation into the classification network to obtain the second classification category output by the classification network. The training module 902 is configured to iteratively train the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, and output the trained first feature extraction network and the trained second feature extraction The internet. Among them, the first loss function is used to represent the similarity between the first category and the first label category, and is used to represent the similarity between the second classification category and the second label category, and the first label category is the same as the adversarial image Corresponding to the correct category, the second label category is the wrong category corresponding to the confrontation image.

In the embodiments of this application, the technicians discovered during the research process that through adversarial training, the neural network only extracts the robust representation from the input image, while discarding the non-robust representation, resulting in a decrease in the accuracy of the neural network when processing the original image. However, in the embodiment of the present application, the robust representation and the non-robust representation in the input image are extracted through the first feature extraction network and the second feature extraction network respectively, which not only avoids the mixing of the two and leads to a decrease in robustness, but also The robust representation and the non-robust representation in the input image can be retained at the same time, thereby avoiding the decrease in accuracy rate, and improving the robustness and accuracy of the neural network at the same time.

In a possible design, please refer to FIG. 10. FIG. 10 is a schematic structural diagram of a neural network training device provided by an embodiment of this application. The input module 901 is also used to input the original image into the first feature extraction network. And a second feature extraction network to obtain a second robust representation generated by the first feature extraction network and a second non-robust representation generated by the second feature extraction network. The device 900 further includes: a combining module 903, configured to combine the second robust representation and the second non-robust representation to obtain the combined first representation. The input module 901 is further configured to input the combined first representation into the classification network to perform a classification operation according to the combined first representation through the classification network to obtain a third classification category output by the classification network. The training module 902 is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the second loss function until the convergence condition is met, where the second loss function is used to represent the third The similarity between the classification category and the third annotation category, which is the correct category corresponding to the original image.

In the embodiment of the present application, during the training process, the training module 902 not only uses confrontation images to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also uses natural images to train the first feature extraction network and the second feature extraction network. The feature extraction capabilities of the feature extraction network are used to further improve the accuracy of the trained first feature extraction network and the trained second feature extraction network in the process of processing natural images.

In a possible design, the input module 901 is also used to input the original image into the first feature extraction network to obtain the second robust representation generated by the first feature extraction network. The input module 901 is further configured to input the second robust representation into the classification network, so as to perform a classification operation according to the second robust representation through the classification network to obtain the fourth classification category output by the classification network. The training module 902 is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the third loss function until the convergence condition is met, where the third loss function is used to represent the fourth The similarity between the classification category and the third annotation category, which is the correct category corresponding to the original image.

In this embodiment of the application, the training module 902 not only uses confrontation images to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also uses natural images to train the robust representation extraction capabilities of the first feature extraction network to Further improve the accuracy of the first feature extraction network after training.

In a possible design, the input module 901 is also used to input the original image into the second feature extraction network to obtain a second non-robust representation generated by the second feature extraction network. The input module 901 is further configured to input the second non-robust representation into the classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain the fifth classification category output by the classification network. The training module 902 is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the fourth loss function until the convergence condition is met, where the fourth loss function is used to represent the fifth The similarity between the classification category and the third annotation category, which is the correct category corresponding to the original image.

In the embodiment of the present application, the training module 902 not only uses confrontation images to train the feature extraction capabilities of the first feature extraction network and the second feature extraction network, but also uses natural images to train the second feature extraction network's extraction capabilities for non-robust representations. In order to further improve the accuracy of the second feature extraction network after training.

In a possible design, the input module 901 is also used to input the original image into the first feature extraction network and the second feature extraction network, respectively, to obtain the second robust representation and the second feature extraction generated by the first feature extraction network The second non-robust representation generated by the network. The input module 901 is also used to combine the second robust representation and the second robust representation to obtain a combined first representation, and input the combined first representation into the classification network to pass the classification network according to the combined first representation Indicates that the classification operation is performed to obtain the third classification category output by the classification network. The input module 901 is further configured to input the second robust representation into the classification network, so as to perform a classification operation according to the second robust representation through the classification network to obtain the fourth classification category output by the classification network. The input module 901 is further configured to input the second non-robust representation into the classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain the fifth classification category output by the classification network. The training module 902 is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the fifth loss function until the convergence condition is met, where the fifth loss function is used to represent the fourth The similarity between the classification category and the third annotation category, and is used to indicate the similarity between the fifth classification category and the third annotation category, and the similarity between the sixth classification category and the third annotation category , The third label category is the correct category corresponding to the original image.

In a possible design, referring to FIG. 10, the device further includes a generating module 904, which is specifically configured to: generate a first gradient according to the function value of the second loss function; and perform perturbation processing on the original image according to the first gradient to A confrontation image is generated, and the third label category is determined as the first label category.

In the embodiment of the present application, the generation module 904 generates the first gradient according to the similarity between the third classification category and the third annotation category, and perturbs the original image according to the first gradient, so that the disturbance processing is more targeted. It is helpful to speed up the training process of the first feature extraction network and the second feature extraction network, and improve the efficiency of the training process.

In a possible design, referring to FIG. 10, the device further includes a generating module 904, specifically configured to: generate a second gradient according to the function value of the third loss function; perform perturbation processing on the original image according to the second gradient to A confrontation image is generated, and the third label category is determined as the first label category.

In the embodiment of the present application, the generation module 904 perturbs the original image according to the similarity between the fourth classification category output by the second robust representation and the third annotation category according to the classification network, so that the perturbation processing is the same as the first feature extraction network. The relationship is more pertinent, which is beneficial to improve the feature extraction ability of the first feature extraction network for robust representation.

In a possible design, referring to FIG. 10, the device further includes a generating module 904, specifically configured to: generate a third gradient according to the function value of the fourth loss function; and perform perturbation processing on the original image according to the third gradient to A confrontation image is generated, and the third label category is determined as the first label category.

In the embodiment of the present application, the generation module 904 perturbs the original image according to the similarity between the fifth classification category output by the classification network according to the second non-robust representation and the third annotation category, so that the perturbation processing and the second feature extraction The networks are more pertinent, which helps to improve the feature extraction ability of the first feature extraction network for non-robust representations.

In a possible design, referring to FIG. 10, the device 900 further includes: a combining module 903, configured to combine the first robust representation and the first non-robust representation to obtain a combined second representation. The input module 901 is also used to input the combined second representation into the classification network to obtain the sixth classification category output by the classification network. The input module 901 is specifically configured to input the first robust representation into the classification network to obtain the first classification category output by the classification network, and input the first non-robust representation into the classification network when the sixth classification category is different from the first label category. The classification network obtains the second classification category output by the classification network.

In a possible design, referring to FIG. 10, the device 900 further includes: a determining module 905, configured to determine the sixth classification category as the second label category when the sixth classification category is different from the first label category . In the embodiment of the present application, a method for obtaining the second label category is provided, which is simple to operate and does not require additional steps, which saves computer resources.

In a possible design, the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network. In the embodiment of the present application, two specific implementation manners of the first feature extraction network and the second feature extraction network are provided, which improves the implementation flexibility of the solution.

It should be noted that the information interaction and execution process among the various modules/units in the neural network training device 900 are based on the same concept as the method embodiments corresponding to FIGS. 3 to 5 in this application. For details, please refer to The descriptions in the foregoing method embodiments of this application will not be repeated here.

An embodiment of the application also provides an image processing network. For details, refer to FIG. 11, which is a schematic structural diagram of a neural network training device provided by an embodiment of the application. The image processing network 1100 includes a first feature extraction network 1101, a second feature extraction network 1102, and a feature processing network 1103. The first feature extraction network 1101 is configured to receive a first image input and generate a lug corresponding to the first image. Robust representation, robust representation refers to features that are not sensitive to disturbances. The second feature extraction network 1102 is configured to receive the input first image, and generate a non-robust representation corresponding to the first image. The non-robust representation refers to a feature that is sensitive to disturbances. The feature processing network 1103 is used to obtain a robust representation and a non-robust representation to output the first processing result corresponding to the first image.

In the embodiment of the present application, the first feature extraction network 1102 and the second feature extraction network 1103 are respectively used to extract the robust representation and the non-robust representation in the input image, which not only avoids the mixing of the two and reduces the robustness, but also The robust representation and the non-robust representation in the input image can be retained at the same time, thereby avoiding the decrease in accuracy rate, and improving the robustness and accuracy of the neural network at the same time.

In a possible design, the feature processing network 1103 is specifically used to: combine the robust representation and the non-robust representation, and according to the combined representation, output the first processing result corresponding to the first image; or, according to the robust The stick means that the first processing result corresponding to the first image is output, and the first case and the second case are different cases; or, according to the non-robust representation, the first processing result corresponding to the first image is output. In the embodiment of the present application, the image processing network 1100 includes both a robust path and a standard path, and the user can flexibly choose which path to use according to the actual situation, which expands the application scenarios of this solution and improves the implementation flexibility of this solution.

In a possible design, the feature processing network 1103 is specifically used to: perform a classification operation according to the combined representation and output a classification category corresponding to the first image; or, perform a classification operation according to a robust representation, and output a classification operation corresponding to the first image. The classification category corresponding to the image; or, the classification operation is performed according to the non-robust representation, and the classification category corresponding to the first image is output. In the embodiment of the present application, the provided image processing network 1100 falls into the specific application scenario of image classification, which improves the degree of integration with the application scenario.

In a possible design, the first processing result indicates that the first image is an original image, or the first processing result indicates that the first image is a disturbed image. In the embodiment of the present application, not only the feature information extracted by the first feature extraction network 1101 and the second extraction network 1102 can be used to obtain the processing result corresponding to the object in the image, but also the processing result corresponding to the entire image can be obtained, that is, It is used to determine whether the image is the original image or the disturbed image, which expands the application scenarios of this solution.

In a possible design, the feature processing network 1103 is specifically used to: determine the first classification category corresponding to the first image according to the robust representation; determine the second classification corresponding to the first image according to the non-robust representation Category; in the case that the first classification category is consistent with the second classification category, the output first processing result indicates that the first image is the original image; in the case where the first classification category is inconsistent with the second classification category, the output first The processing result indicates that the first image is a disturbed image.

In the embodiment of the present application, the feature processing network 1103 determines whether the first image is the original image or the adversarial image by judging whether the seventh classification category and the eighth classification category are consistent. The method is simple and has strong operability.

In a possible design, the feature processing network 1103 is specifically used to combine the robust representation and the non-robust representation, and perform detection operations according to the combined representation to output the detection result corresponding to the first image. The processing result includes the test result. In the embodiment of the present application, another implementation manner of determining whether the first image is the original image or the confrontation image is provided, which enhances the implementation flexibility of the solution.

In a possible design, the image processing network 1100 is one or more of the following: an image classification network, an image recognition network, an image segmentation network, or an image detection network. In the embodiments of the present application, a variety of specific implementation manners of the image processing network 1100 are provided, which expands the application scenarios of this solution and improves the implementation flexibility of this solution.

In one possible design, the feature processing network 1103 includes a perceptron.

In a possible design, the first feature extraction network 1101 is a convolutional neural network or a residual neural network, and the second feature extraction network 1102 is a convolutional neural network or a residual neural network.

The embodiment of the present application also provides an execution device. Please refer to FIG. 12. FIG. 12 is a schematic structural diagram of the execution device provided by the embodiment of the application. Driving vehicles, smart home appliances, chips, or other states, etc., are not limited here. The image processing network 1100 described in the embodiment corresponding to FIG. 11 may be deployed on the execution device 1200 to implement the functions of the execution device in the embodiment corresponding to FIG. 6 to FIG. 8. The execution device 1200 includes: a receiver 1201, a transmitter 1202, a processor 1203, and a memory 1204 (the number of processors 1203 in the data generating apparatus 1200 may be one or more, and one processor is taken as an example in FIG. 12), where The processor 1203 may include an application processor 12031 and a communication processor 12032. In some embodiments of the embodiments of the present application, the receiver 1201, the transmitter 1202, the processor 1203, and the memory 1204 may be connected by a bus or other methods.

The memory 1204 may include a read-only memory and a random access memory, and provides instructions and data to the processor 1203. A part of the memory 1204 may also include a non-volatile random access memory (NVRAM). The memory 1204 stores a processor and operating instructions, executable modules or data structures, or a subset of them, or an extended set of them. The operating instructions may include various operating instructions for implementing various operations.

The processor 1203 controls the operation of the data generating device. In a specific application, the various components of the data generating device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. However, for the sake of clear description, various buses are referred to as bus systems in the figure.

The methods disclosed in the above embodiments of the present application may be applied to the processor 1203 or implemented by the processor 1203. The processor 1203 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 1203 or instructions in the form of software. The above-mentioned processor 1203 may be a general-purpose processor, a digital signal processing (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The processor 1203 can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 1204, and the processor 1203 reads the information in the memory 1204, and completes the steps of the foregoing method in combination with its hardware.

The receiver 1201 can be used to receive input digital or character information, and to generate signal input related to the related settings and function control of the data generating device. The transmitter 1202 can be used to output digital or character information through the first interface. The transmitter 1202 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group. The transmitter 1202 can also include display devices such as a display screen. .

In the embodiment of the present application, the processor 1203 is configured to execute the image processing method executed by the execution device in the embodiment corresponding to FIG. 6 to FIG. 8. Specifically, the application processor 12031 is configured to perform the following steps: input the first image into the first feature extraction network to obtain a robust representation corresponding to the first image generated by the first feature extraction network, and the robust representation refers to the disturbance Insensitive features; the first image is input to the second feature extraction network, and the non-robust representation corresponding to the first image generated by the second feature extraction network is obtained. The non-robust representation refers to the features that are sensitive to disturbance; through the feature The processing network outputs a first processing result corresponding to the first image according to the robust representation and the non-robust representation, and the first feature extraction network, the second feature extraction network, and the feature processing network belong to the same image processing network.

It should be noted that the application processor 12031 is also used to execute other steps executed by the execution device in the method embodiments corresponding to FIG. 6 to FIG. You can refer to the descriptions in the respective method embodiments corresponding to FIG. 2 to FIG. 8, which will not be repeated here.

The embodiment of the present application also provides a training device. Please refer to FIG. 13, which is a schematic structural diagram of the training device provided by the embodiment of the present application. The training device 900 described in the embodiment corresponding to FIG. 9 and FIG. 10 may be deployed on the training device 1300 to realize the function of the training device in the embodiment corresponding to FIG. 3 and FIG. 5. Specifically, the training device 1300 consists of one or more Implementation of a single server, the training device 1300 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 1322 (for example, one or more processors) and a memory 1332 , One or more storage media 1330 for storing application programs 1342 or data 1344 (for example, one or one storage device with a large amount of storage). Among them, the memory 1332 and the storage medium 1330 may be short-term storage or persistent storage. The program stored in the storage medium 1330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the training device. Furthermore, the central processing unit 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the training device 1300.

The training device 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input and output interfaces 1358, and/or one or more operating systems 1341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In the embodiment of the present application, the central processing unit 1322 is configured to execute the image processing method executed by the training device in the embodiment corresponding to FIG. 3. Specifically, the central processing unit 1322 is configured to input the confrontation image into the first feature extraction network and the second feature extraction network to obtain the first robust representation generated by the first feature extraction network and the first robust representation generated by the second feature extraction network. Non-robust representation, where the counter image is an image that has undergone disturbance processing. Robust representation refers to features that are not sensitive to disturbance, and non-robust representation refers to features that are sensitive to disturbance. The first robust is represented by the input classification network to obtain the first classification category output by the classification network, and the first non-robust is represented by the input classification network to obtain the second classification category output by the classification network; according to the first loss function, the first classification The feature extraction network and the second feature extraction network are iteratively trained until the convergence condition is met, and the trained first feature extraction network and the trained second feature extraction network are output. Among them, the first loss function is used to represent the similarity between the first category and the first label category, and is used to represent the similarity between the second classification category and the second label category, and the first label category is the same as the adversarial image Corresponding to the correct category, the second label category is the wrong category corresponding to the confrontation image.

It should be noted that the central processing unit 1322 is also used to execute other steps executed by the execution device in the embodiment corresponding to FIG. Refer to the descriptions in the respective method embodiments corresponding to FIG. 3, which will not be repeated here.

The embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores a program. When it runs on a computer, the computer executes the steps described in the embodiments shown in FIGS. 3 to 5 above. The steps performed by the training device in the method.

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program, and when it runs on a computer, the computer executes the steps described in the foregoing embodiments shown in FIGS. 6 to 8 The steps performed by the device in the method.

The embodiment of the present application also provides a product including a computer program, which when it is driven on a computer, causes the computer to execute the steps performed by the training device in the method described in the embodiments shown in FIGS. 3 to 5, or, The computer executes the steps executed by the execution device in the method described in the foregoing embodiments shown in FIG. 6 to FIG. 8.

An embodiment of the present application also provides a circuit system, the circuit system includes a processing circuit configured to perform the steps performed by the training device in the method described in the embodiments shown in FIGS. 3 to 5, or The processing circuit is configured to execute the steps performed by the execution device in the method described in the embodiments shown in FIG. 6 to FIG. 8.

The training device or execution device of the neural network provided in the embodiment of the present application may specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, Pins or circuits, etc. The processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the training device executes the neural network training method described in the embodiments shown in FIGS. 3 to 5, or so that the chip in the execution device executes the above The image processing method described in the embodiments shown in FIGS. 6 to 8. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.

Specifically, please refer to FIG. 14. FIG. 14 is a schematic diagram of a structure of a chip provided by an embodiment of the application. The chip may be expressed as a neural network processor NPU 140, which is mounted as a coprocessor to the main CPU (Host On the CPU), the Host CPU assigns tasks. The core part of the NPU is the arithmetic circuit 1403, and the controller 1404 controls the arithmetic circuit 1403 to extract matrix data from the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 1403 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 1403 is a two-dimensional systolic array. The arithmetic circuit 1403 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1403 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the corresponding data of matrix B from the weight memory 1402 and caches it on each PE in the arithmetic circuit. The arithmetic circuit fetches the matrix A data and matrix B from the input memory 1401 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 1408.

The unified memory 1406 is used to store input data and output data. The weight data directly passes through the memory unit access controller (Direct Memory Access Controller, DMAC) 1405, and the DMAC is transferred to the weight memory 1402. The input data is also transferred to the unified memory 1406 through the DMAC.

The BIU is the Bus Interface Unit, that is, the bus interface unit 1410, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (IFB) 1409.

The bus interface unit 1410 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1409 to obtain instructions from the external memory, and is also used for the storage unit access controller 1405 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1406 or to transfer the weight data to the weight memory 1402 or to transfer the input data to the input memory 1401.

The vector calculation unit 1407 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used in the calculation of non-convolutional/fully connected layer networks in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.

In some implementations, the vector calculation unit 1407 can store the processed output vector to the unified memory 1406. For example, the vector calculation unit 1407 may apply a linear function and/or a non-linear function to the output of the arithmetic circuit 1403, such as linearly interpolating the feature plane extracted by the convolutional layer, and for example a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 1407 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 1403, for example for use in a subsequent layer in a neural network.

The instruction fetch buffer 1409 connected to the controller 1404 is used to store instructions used by the controller 1404;

The unified memory 1406, the input memory 1401, the weight memory 1402, and the fetch memory 1409 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Among them, the calculation of each layer in the recurrent neural network can be executed by the arithmetic circuit 1403 or the vector calculation unit 1407.

Wherein, the processor mentioned in any of the foregoing may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the program of the method in the first aspect.

In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physically separate. The physical unit can be located in one place or distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

Through the description of the above embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CLUs, dedicated memories, Dedicated components and so on to achieve. Under normal circumstances, all functions completed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. Circuit etc. However, for this application, software program implementation is a better implementation in more cases. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) execute the methods described in each embodiment of this application .

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Claims

A neural network training method, characterized in that the method includes:

The confrontation image is input into the first feature extraction network and the second feature extraction network, respectively, to obtain a first robust representation generated by the first feature extraction network and a first non-robust representation generated by the second feature extraction network, where , The counter image is an image that has undergone disturbance processing on the original image, the robust representation refers to features that are not sensitive to disturbance, and the non-robust representation refers to features that are sensitive to disturbance;

Inputting the first robust representation to the classification network to obtain a first classification category output by the classification network, and inputting the first non-robust representation to the classification network to obtain a second classification category output by the classification network;

According to the first loss function, perform iterative training on the first feature extraction network and the second feature extraction network until convergence conditions are met, and output the trained first feature extraction network and the trained second feature extraction network;

Wherein, the first loss function is used to represent the similarity between the first classification category and the first label category, and is used to represent the similarity between the second classification category and the second label category, so The first label category is a correct category corresponding to the confrontation image, and the second label category is an error category corresponding to the confrontation image.
The method according to claim 1, wherein the method further comprises:

The original image is input into the first feature extraction network and the second feature extraction network, respectively, to obtain a second robust representation generated by the first feature extraction network and a second feature extraction network generated by the second feature extraction network. Non-robust representation;

Combining the second robust representation and the second non-robust representation to obtain a combined first representation;

Inputting the combined first representation into the classification network to perform a classification operation according to the combined first representation through the classification network to obtain the third classification category output by the classification network;

The iterative training of the first feature extraction network and the second feature extraction network according to the first loss function until a convergence condition is met includes:

According to the first loss function and the second loss function, the first feature extraction network and the second feature extraction network are iteratively trained until the convergence condition is met, wherein the second loss function is used to represent the The similarity between the third classification category and the third annotation category, where the third annotation category is a correct category corresponding to the original image.
The method according to claim 1, wherein the method further comprises:

Input the original image into the first feature extraction network to obtain a second robust representation generated by the first feature extraction network;

Inputting the second robust representation into a classification network to perform a classification operation according to the second robust representation through the classification network to obtain a fourth classification category output by the classification network;

The iterative training of the first feature extraction network and the second feature extraction network according to the first loss function until a convergence condition is met includes:

According to the first loss function and the third loss function, the first feature extraction network and the second feature extraction network are iteratively trained until the convergence condition is satisfied, wherein the third loss function is used to represent the The similarity between the fourth classification category and the third annotation category, where the third annotation category is a correct category corresponding to the original image.
The method according to claim 1, wherein the method further comprises:

Inputting the original image into the second feature extraction network to obtain a second non-robust representation generated by the second feature extraction network;

Inputting the second non-robust representation into a classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain a fifth classification category output by the classification network;

The iterative training of the first feature extraction network and the second feature extraction network according to the first loss function until a convergence condition is met includes:

According to the first loss function and the fourth loss function, the first feature extraction network and the second feature extraction network are iteratively trained until the convergence condition is met, wherein the fourth loss function is used to represent the The similarity between the fifth classification category and the third annotation category, where the third annotation category is a correct category corresponding to the original image.
The method according to claim 1, wherein the method further comprises:

The original image is input into the first feature extraction network and the second feature extraction network, respectively, to obtain a second robust representation generated by the first feature extraction network and a second feature extraction network generated by the second feature extraction network. Non-robust representation;

The second robust representation and the second robust representation are combined to obtain a combined first representation, and the combined first representation is input to the classification network to use the classification network according to the combined first representation Perform the classification operation to obtain the third classification category output by the classification network;

Inputting the second robust representation into a classification network to perform a classification operation according to the second robust representation through the classification network to obtain a fourth classification category output by the classification network;

Inputting the second non-robust representation into a classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain a fifth classification category output by the classification network;

The iterative training of the first feature extraction network and the second feature extraction network according to the first loss function until a convergence condition is met includes:

According to the first loss function and the fifth loss function, the first feature extraction network and the second feature extraction network are iteratively trained until convergence conditions are met, wherein the fifth loss function is used to represent the The similarity between the fourth classification category and the third annotation category is used to indicate the similarity between the fifth classification category and the third annotation category, and is used to indicate the similarity between the sixth classification category and the The degree of similarity between the third annotation categories, where the third annotation category is a correct category corresponding to the original image.
The method according to claim 2, wherein the method further comprises:

Generating a first gradient according to the function value of the second loss function;

Perform perturbation processing on the original image according to the first gradient to generate the confrontation image, and determine the third label category as the first label category.
The method according to claim 3, wherein the method further comprises:

Generating a second gradient according to the function value of the third loss function;

Performing perturbation processing on the original image according to the second gradient to generate the confrontation image, and determining the third label category as the first label category.
The method according to claim 4, wherein the method further comprises:

Generating a third gradient according to the function value of the fourth loss function;

Performing perturbation processing on the original image according to the third gradient to generate the confrontation image, and determining the third label category as the first label category.
The method according to any one of claims 1 to 8, wherein the method further comprises:

Combining the first robust representation and the first non-robust representation to obtain a combined second representation;

Input the combined second representation into the classification network to obtain the sixth classification category output by the classification network;

The inputting the first robust representation into the classification network to obtain the first classification category output by the classification network, and inputting the first non-robust representation into the classification network to obtain the second classification category output by the classification network includes:

In the case that the sixth classification category is different from the first annotation category, the first robust representation is input to the classification network to obtain the first classification category output by the classification network, and the first non-robust representation is Input the classification network to obtain the second classification category output by the classification network.
The method according to claim 9, wherein the method further comprises:

In a case where the sixth classification category is different from the first annotation category, the sixth classification category is determined as the second annotation category.
The method according to any one of claims 1 to 8, wherein the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network. Poor neural network.
An image processing network, characterized in that the image processing network includes a first feature extraction network, a second feature extraction network, and a feature processing network;

The first feature extraction network is configured to receive an input first image, and generate a robust representation corresponding to the first image, where the robust representation refers to features that are not sensitive to disturbances;

The second feature extraction network is configured to receive the input first image, and generate a non-robust representation corresponding to the first image, where the non-robust representation refers to a feature sensitive to disturbance;

The feature processing network is configured to obtain the robust representation and the non-robust representation to output a first processing result corresponding to the first image.
The network according to claim 12, wherein the characteristic processing network is specifically used for:

Combine the robust representation and the non-robust representation, and output a first processing result corresponding to the first image according to the combined representation; or

According to the robust representation, output a first processing result corresponding to the first image; or

According to the non-robust representation, a first processing result corresponding to the first image is output.
The network according to claim 12 or 13, wherein the characteristic processing network is specifically used for:

Perform a classification operation according to the combined representation, and output a classification category corresponding to the first image; or,

Perform a classification operation according to the robust representation, and output a classification category corresponding to the first image; or,

Perform a classification operation according to the non-robust representation, and output a classification category corresponding to the first image.
The network according to claim 12, wherein the first processing result indicates that the first image is an original image, or the first processing result indicates that the first image is a disturbed image.
The network according to claim 15, wherein the characteristic processing network is specifically used for:

Determine, according to the robust representation, a first classification category corresponding to the first image;

Determining a second classification category corresponding to the first image according to the non-robust representation;

In the case that the first classification category is consistent with the second classification category, the output first processing result indicates that the first image is an original image;

In a case where the first classification category is inconsistent with the second classification category, the output first processing result indicates that the first image is a disturbed image.
The network according to claim 12, wherein the characteristic processing network is specifically used for:

The robust representation and the non-robust representation are combined, and a detection operation is performed according to the combined representation to output a detection result corresponding to the first image, and the first processing result includes the detection result.
The network according to claim 12 or 13, characterized in that:

The image processing network is one or more of the following: an image classification network, an image recognition network, an image segmentation network, or an image detection network.
The network according to any one of claims 12 to 13, wherein the characteristic processing network comprises a perceptron.
The network according to any one of claims 12 to 13, wherein the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network. Poor neural network.
A neural network training device, characterized in that the device includes:

The input module is used to input the confrontation image into the first feature extraction network and the second feature extraction network to obtain the first robust representation generated by the first feature extraction network and the first non-representation generated by the second feature extraction network. Robust representation, where the counter image is an image that has been subjected to disturbance processing on the original image, robust representation refers to features that are not sensitive to disturbance, and non-robust representation refers to features that are sensitive to disturbance;

The input module is further configured to input the first robust representation into the classification network to obtain the first classification category output by the classification network, and input the first non-robust representation into the classification network to obtain the second classification output of the classification network. Classification category

The training module is configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function until the convergence condition is met, and output the trained first feature extraction network and the trained first feature extraction network 2. Feature extraction network;

Wherein, the first loss function is used to represent the similarity between the first category and the first label category, and is used to represent the similarity between the second classification category and the second label category, and the The first label category is a correct category corresponding to the confrontation image, and the second label category is an error category corresponding to the confrontation image.
The device of claim 21, wherein:

The input module is further configured to input the original image into the first feature extraction network and the second feature extraction network, respectively, to obtain the second robust representation generated by the first feature extraction network and the first feature extraction network. Second, the second non-robust representation generated by the feature extraction network;

The device further includes: a combination module, configured to combine the second robust representation and the second non-robust representation to obtain a combined first representation;

The input module is further configured to input the combined first representation into a classification network, so as to perform a classification operation according to the combined first representation through the classification network to obtain a third classification category output by the classification network;

The training module is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the second loss function until a convergence condition is met, wherein the The second loss function is used to represent the similarity between the third classification category and the third annotation category, and the third annotation category is a correct category corresponding to the original image.
The device according to claim 21 or 22, wherein:

The input module is further configured to input the original image into the first feature extraction network to obtain a second robust representation generated by the first feature extraction network;

The input module is further configured to input the second robust representation into a classification network, so as to perform a classification operation according to the second robust representation through the classification network to obtain a fourth classification category output by the classification network;

The training module is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the third loss function until a convergence condition is met, wherein the The third loss function is used to represent the similarity between the fourth classification category and the third annotation category, and the third annotation category is a correct category corresponding to the original image.
The device according to claim 21 or 22, wherein:

The input module is further configured to input the original image into the second feature extraction network to obtain a second non-robust representation generated by the second feature extraction network;

The input module is further configured to input the second non-robust representation into a classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain a fifth classification category output by the classification network;

The training module is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the fourth loss function until a convergence condition is met, wherein the The fourth loss function is used to indicate the similarity between the fifth classification category and the third label category, and the third label category is a correct category corresponding to the original image.
The device of claim 21, wherein:

The input module is further configured to input the original image into the first feature extraction network and the second feature extraction network, respectively, to obtain the second robust representation generated by the first feature extraction network and the first feature extraction network. Second, the second non-robust representation generated by the feature extraction network;

The input module is further configured to combine the second robust representation and the second robust representation to obtain a combined first representation, and input the combined first representation into a classification network to pass the classification network according to The combined first means to perform a classification operation to obtain the third classification category output by the classification network;

The input module is further configured to input the second robust representation into a classification network to perform a classification operation according to the second robust representation through the classification network to obtain a fourth classification category output by the classification network;

The input module is further configured to input the second non-robust representation into a classification network to perform a classification operation according to the second non-robust representation through the classification network to obtain a fifth classification category output by the classification network;

The training module is specifically configured to perform iterative training on the first feature extraction network and the second feature extraction network according to the first loss function and the fifth loss function until a convergence condition is met, wherein the The fifth loss function is used to represent the similarity between the fourth classification category and the third label category, and is used to represent the similarity between the fifth classification category and the third label category, and is used for Represents the similarity between the sixth classification category and the third annotation category, and the third annotation category is a correct category corresponding to the original image.
The device of claim 22, wherein:

The device also includes a generating module, which is specifically configured to:

Generating a first gradient according to the function value of the second loss function;

Perform perturbation processing on the original image according to the first gradient to generate the confrontation image, and determine the third label category as the first label category.
The device of claim 23, wherein:

The device also includes a generating module, which is specifically configured to:

Generating a second gradient according to the function value of the third loss function;

Performing perturbation processing on the original image according to the second gradient to generate the confrontation image, and determining the third label category as the first label category.
The device of claim 24, wherein:

The device also includes a generating module, which is specifically configured to:

Generating a third gradient according to the function value of the fourth loss function;

Performing perturbation processing on the original image according to the third gradient to generate the confrontation image, and determining the third label category as the first label category.
The device according to any one of claims 21 to 28, characterized in that:

The device further includes: a combination module, configured to combine the first robust representation and the first non-robust representation to obtain a combined second representation;

The input module is further configured to input the combined second representation into the classification network to obtain the sixth classification category output by the classification network;

The input module is specifically configured to input the first robust representation into the classification network to obtain the first classification category output by the classification network when the sixth classification category is different from the first annotation category, and The first non-robust means that the input classification network is input, and the second classification category output by the classification network is obtained.
The device according to claim 29, wherein the device further comprises: a determining module, configured to classify the sixth classification category when the sixth classification category is different from the first label category Determined as the second label category.
The device according to any one of claims 21 to 28, wherein the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network. Poor neural network.
An image processing method, characterized in that the method includes:

Inputting the first image into a first feature extraction network to obtain a robust representation corresponding to the first image generated by the first feature extraction network, where the robust representation refers to features that are not sensitive to disturbances;

The first image is input into a second feature extraction network to obtain a non-robust representation corresponding to the first image generated by the second feature extraction network, and the non-robust representation refers to a feature that is sensitive to disturbances ；

Through the feature processing network, according to the robust representation and the non-robust representation, a first processing result corresponding to the first image is output.
The method according to claim 32, wherein the outputting a first processing result corresponding to the first image according to the robust representation and the non-robust representation through the feature processing network comprises:

Combine the robust representation and the non-robust representation, and output a first processing result corresponding to the first image according to the combined representation through the feature processing network; or,

Through the feature processing network, output a first processing result corresponding to the first image according to the robust representation, and the first situation and the second situation are different situations; or,

Through the feature processing network, according to the non-robust representation, a first processing result corresponding to the first image is output.
The method according to claim 32 or 33, wherein:

Through the feature processing network, perform a classification operation according to the combined representation, and output a classification category corresponding to the first image; or,

Through the feature processing network, perform a classification operation according to the robust representation, and output a classification category corresponding to the first image; or,

Through the feature processing network, a classification operation is performed according to the non-robust representation, and a classification category corresponding to the first image is output.
The method according to claim 32, wherein the first processing result indicates that the first image is an original image, or the first processing result indicates that the first image is a disturbed image.
The method according to claim 35, wherein the outputting a first processing result corresponding to the first image according to the robust representation and the non-robust representation through the feature processing network comprises:

Through the feature processing network, the first classification category corresponding to the first image is determined according to the robust representation, and the second classification category corresponding to the first image is determined according to the non-robust representation ；

In a case where the first classification category is consistent with the second classification category, the first processing result output by the feature processing network indicates that the first image is an original image;

In a case where the first classification category is inconsistent with the second classification category, the first processing result output by the feature processing network indicates that the first image is a disturbed image.
The method according to claim 32, wherein the outputting a first processing result corresponding to the first image according to the robust representation and the non-robust representation through the feature processing network comprises:

Through the feature processing network, the robust representation and the non-robust representation are combined, and a detection operation is performed according to the combined representation to output a detection result corresponding to the first image. The first processing The result includes the test result.
The method according to claim 32 or 33, wherein the feature processing network comprises a perceptron.
The method according to claim 32 or 33, wherein the first feature extraction network is a convolutional neural network or a residual neural network, and the second feature extraction network is a convolutional neural network or a residual neural network .
A training device, comprising a processor, the processor is coupled with a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, claims 1 to 11 are implemented The method of any one of.
An execution device, characterized in that an image processing network is configured in the execution device, and the image processing network is the image processing network according to any one of claims 12 to 20.
The execution device according to claim 41, wherein the execution device is one or more of the following: mobile phones, computers, wearable devices, autonomous vehicles, smart home appliances, and chips.
A computer-readable storage medium, characterized by comprising a program, which when running on a computer, causes the computer to execute the method according to any one of claims 1 to 11, or causes the computer to execute the method according to claim 32 To the method of any one of 39.
A circuit system, characterized in that the circuit system comprises a processing circuit configured to execute the method according to any one of claims 1 to 11, or the processing circuit is configured to execute The method of any one of 32 to 39 is required.