CN113344189B

CN113344189B - Neural network training method and device, computer equipment and storage medium

Info

Publication number: CN113344189B
Application number: CN202110696473.XA
Authority: CN
Inventors: 郑明凯; 游山; 王飞; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2022-10-18
Anticipated expiration: 2041-06-23
Also published as: CN113344189A

Abstract

The present disclosure provides a training method, device, computer device and storage medium for neural network, the method includes: determining a first image feature of each first sample image in the first sample group and a second image feature of a second sample image corresponding to each first sample image in a second sample group corresponding to the first sample group by using a target network to be trained; obtaining a first class characterization of each first sample image based on the first image characteristics, and obtaining a second class characterization of each second sample image based on the second image characteristics; determining first similarity information based on the first category characterization of each first sample image; determining second similarity information based on the second class characterization for each second sample image; and taking the first similarity information as the supervision information of the second sample group, taking the second similarity information as the supervision information of the first sample group, and training the target network to be trained.

Description

Neural network training method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer and image processing technologies, and in particular, to a neural network training method and apparatus, a computer device, and a storage medium.

Background

With the development of artificial intelligence technology, neural networks are applied more and more frequently in the aspects of image classification and the like. According to different classification requirements, training the neural network through the sample image to obtain a neural network model, and then classifying through the neural network model.

Currently, in the training process of the neural network, the neural network considers each sample image as a class and then classifies the sample images. However, some sample images are similar images, and if they are classified into different categories, the accuracy of classification is reduced.

Disclosure of Invention

The embodiment of the disclosure at least provides a training method and device of a neural network, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for training a neural network, including:

determining a first image feature of each first sample image in a first sample group and a second image feature of a second sample image corresponding to each first sample image in a second sample group corresponding to the first sample group by using a target network to be trained; the first sample image and the corresponding second sample image are different enhanced images of the same original sample image;

classifying the plurality of first sample images based on the first image characteristics to obtain a first class representation of each first sample image, and classifying the plurality of second sample images based on the second image characteristics to obtain a second class representation of each second sample image;

determining first similarity information between every two first sample images in the first sample group based on the first class characterization of each first sample image;

determining second similarity information between every two second sample images in the second sample group based on the second class characterization of each second sample image;

and taking the first similarity information as supervision information of the second sample group, taking the second similarity information as supervision information of the first sample group, and training the target network to be trained.

The method and the device cluster the sample images, and use the first similarity information and the second similarity information as the supervision information, so that on one hand, classification is performed according to similar samples, and classification accuracy can be effectively improved compared with the classification of each image as a category, on the other hand, the similarity information of each group of enhanced images is used as the supervision information of other groups of enhanced images, self-supervision learning can be performed on a target network, and the training efficiency of the target network is improved; meanwhile, the same original sample image is enhanced twice to obtain two corresponding enhanced images (the first sample image and the second sample image), so that the situation that when the image characteristics among different enhanced images are overlapped more due to more times of enhancing the same original sample image is avoided, and the diversity of the sample images is reduced.

In an optional embodiment, the training the target network to be trained by using the first similarity information as the supervision information of the second sample group and using the second similarity information as the supervision information of the first sample group includes:

determining a first training loss based on the first class characteristics of each first sample image and the second similarity information;

determining a second training loss based on a second class characteristic of each second sample image and the first similarity information;

training the target network to be trained based on the first training loss and the second training loss.

In the embodiment of the disclosure, the second similarity information can relatively accurately represent the classification result of the first sample image, and the first similarity information can relatively accurately represent the classification result of the second sample image, so that the target network is trained by using the first training loss determined based on the first class characteristic and the second similarity information of each first sample image and the second training loss determined based on the second class characteristic and the first similarity information of each second sample image, and the training accuracy can be improved.

In an optional embodiment, the method further comprises: determining a third training loss based on the first image features of each first sample image and the corresponding second image features of each second sample image;

the training the target network to be trained based on the first training loss and the second training loss includes:

training the target network to be trained based on the first training loss, the second training loss, and the third training loss.

According to the embodiment of the disclosure, each sample image (the first sample image or the second sample image) is used as a class to establish a third training loss, and then the supervision information obtained by clustering similar sample images is combined to train the target network, so that the target network can learn the capabilities of classifying each sample image and classifying similar samples, and the training accuracy of the target network is improved.

In an alternative embodiment, the determining the third training loss based on the first image feature of each first sample image and the corresponding second image feature of each second sample image includes:

for each first sample image, determining loss information corresponding to the first sample image based on first image features of the first sample image and second image features of a second sample image corresponding to the first sample image;

and determining a third training loss based on the loss information corresponding to each first sample image.

According to the embodiment of the invention, the third training loss can be accurately determined by using the image characteristics of the two sample images of the same original sample image, and the target network can be trained to maximize the consistency between different enhanced images of the same original sample image by using the training loss, so that the classification precision of different enhanced images of the same original sample image can be improved.

In an optional embodiment, the determining first similarity information between each two first sample images in the first sample group based on the first category characterization of each first sample image includes:

based on the first class representation of each first sample image, connecting two first sample images with similarity greater than a first preset threshold value to generate at least one first connection graph;

for each first connection diagram, determining similarity information corresponding to the first connection diagram based on first class representations corresponding to all first sample images in the first connection diagram;

determining first similarity information between every two first sample images in the first sample group based on the similarity information corresponding to each first connection graph;

the determining second similarity information between every two second sample images in the second sample group based on the second class characterization of each second sample image comprises:

based on the second category characterization of each second sample image, connecting two second sample images with similarity greater than a second preset threshold value to generate at least one second connected graph;

for each second connected graph, determining similarity information corresponding to the second connected graph based on second class representations corresponding to all second sample images in the second connected graph;

and determining second similarity information between every two second sample images in the second sample group based on the similarity information corresponding to each second connected graph.

In the connected graph formed by the embodiment of the disclosure, any pair of sample images in each connected subgraph is similar, but the sample images in different connected subgraphs are dissimilar, and the similarity information can be determined more quickly and accurately by the way of the connected graph, so that the training efficiency and accuracy can be improved.

In an optional embodiment, the method further comprises:

respectively obtaining a plurality of third sample images corresponding to each original sample image; wherein the third sample image is an enhanced image of the corresponding original sample image;

and simultaneously using each third sample image in the third sample images as a first sample image and a second sample image to train the target network.

According to the embodiment of the disclosure, a plurality of third sample images corresponding to each original sample image are obtained and are used as the first sample image and the second sample image, so that the number and diversity of the sample images are increased, and the training precision is improved.

In an optional embodiment, the separately acquiring a plurality of third sample images corresponding to each original sample image includes:

determining a current first iteration number;

and respectively acquiring a plurality of third sample images generated based on each original sample image under the condition that the first iteration times are less than the preset iteration times.

According to the embodiment of the disclosure, under the condition of less iteration times, the number and diversity of the sample images are increased by directly using the enhanced images of the original sample images, and the training precision can be effectively improved.

determining target image features corresponding to other original sample images except the original sample image for each original sample image, wherein the target image features corresponding to the other original sample images are first image features or second image features corresponding to the other original sample images;

respectively determining first image characteristics or second image characteristics of the original sample image and third similarity information between target image characteristics corresponding to each other original sample image;

and acquiring a plurality of third sample images corresponding to the original sample image based on the determined third similarity information.

According to the embodiment of the disclosure, a plurality of third sample images for target network training can be determined accurately through the similarity between the image features, so that the number and diversity of the sample images are increased, and the training accuracy of the target network is improved.

In an optional embodiment, the obtaining, based on the determined third similarity information, a plurality of third sample images corresponding to the original sample image includes:

screening the target image characteristics of which the similarity accords with preset conditions;

and generating a plurality of third sample images based on the first sample image or the second sample image corresponding to the target image features obtained by screening.

The embodiment of the disclosure can generate the enhanced images of the plurality of first sample images or the second sample images, namely the third sample image, on the basis of the first sample images or the second sample images, thereby increasing the number and diversity of the sample images and being beneficial to improving the training accuracy of the target network.

In an optional embodiment, the determining, for each original sample image, a target image feature corresponding to each other original sample image except the original sample image includes:

determining a current second iteration number;

and under the condition that the second iteration times are larger than the preset iteration times, determining the target image characteristics corresponding to each original sample image except the original sample image.

According to the embodiment of the disclosure, under the condition of more iteration times, the performance of the target network is higher, the more accurate first image feature or second image feature can be determined, and the third sample image with higher precision can be determined based on the first image feature or the second image feature, so that the training accuracy of the target network can be improved.

In an optional embodiment, after acquiring the plurality of third sample images, the method further includes:

determining a third image characteristic of each third sample image by using a target network to be trained;

the training the target network with each of the third sample images as a first sample image and a second sample image at the same time includes:

based on the third image characteristics, carrying out classification processing on the multiple third sample images to obtain a third class representation of each third sample image;

for each third sample image, respectively determining second loss information between each first sample image and the third sample image based on the third class characterization of the third sample image and the first class characterization of each first sample image;

for each third sample image, determining third loss information between each second sample image and the third sample image based on a third class characterization of the third sample image and a second class characterization of each second sample image, respectively;

determining a fourth training loss based on the determined second loss information and the third loss information;

training the target network to be trained based on the first training loss, the second training loss, the third training loss, and the fourth training loss.

According to the embodiment of the disclosure, a fourth training loss is determined by using a third class representation corresponding to a newly added third sample image, and the first training loss, the second training loss and the third training loss are combined, so that the training precision of a target network is facilitated.

In an optional embodiment, the training the target network to be trained based on the first training loss, the second training loss, the third training loss, and the fourth training loss includes:

for each third sample image, determining fourth loss information between each first sample image and the third sample image based on third image features of the third sample image and first image features of each first sample image respectively;

for each third sample image, determining fifth loss information between each second sample image and the third sample image based on third image features of the third sample image and second image features of each second sample image, respectively;

determining a fifth training loss based on the determined fourth loss information and the fifth loss information;

training the target network to be trained based on the first training loss, the second training loss, the third training loss, the fourth training loss, and the fifth training loss.

According to the embodiment of the disclosure, a fifth training loss is determined by using a third image feature corresponding to a newly added third sample image, and the first training loss, the second training loss, the third training loss and the fourth training loss are combined, so that the training precision of the target network is facilitated.

In a second aspect, an embodiment of the present disclosure further provides a training apparatus for a neural network, including:

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, the disclosed embodiments further provide a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the training apparatus, the computer device, and the readable storage medium of the neural network, reference is made to the description of the training method of the neural network, and details are not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 illustrates a flowchart of a training method of a neural network provided by an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of the effect of a sample image provided by an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of network training provided by an embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of a connectivity graph provided by an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating an effect of clustering sample images provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a training apparatus for a neural network provided by an embodiment of the present disclosure;

fig. 7 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure.

Currently, the trained neural network is classified by considering each sample image as a class. However, some sample images are similar images, and if they are classified into different categories, the accuracy of classification is reduced.

Based on this, the embodiment of the disclosure provides a training method and apparatus for a neural network, an electronic device and a computer storage medium, the embodiment of the disclosure clusters sample images, and uses first similarity information and second similarity information as supervision information, on one hand, classification is performed according to similar samples, and compared with classifying each image as a category, the accuracy of classification can be effectively improved, on the other hand, similarity information of each group of enhanced images is used as supervision information of other groups of enhanced images, so that self-supervision learning can be performed on a target network, and the training efficiency of the target network is improved; meanwhile, the same original sample image is enhanced twice to obtain two corresponding enhanced images (the first sample image and the second sample image), so that the situation that the number of times of enhancing the same original sample image is large, and when the image features of different enhanced images are overlapped, the diversity of the sample images is reduced is avoided.

The above drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above problems and the solutions proposed by the present disclosure in the following description should be the contribution of the inventor to the present disclosure in the course of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a training method of a neural network disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the training method of the neural network provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example, a server or other processing devices, and in some possible implementations, the training method of the neural network may be implemented by a processor calling computer readable instructions stored in a memory.

The following describes a training method of a neural network provided by an embodiment of the present disclosure by taking an execution subject as a server.

First, it is described that the neural network training method provided in the embodiments of the present disclosure is a method for implementing a training network based on self-supervision contrast learning, and the network obtained through the contrast learning training has a strong expression capability, and generally requires only a small amount of labeled data for fine tuning, so that very excellent performance can be obtained, and the network can be used as a downstream task of a computer vision task, such as classification, segmentation, detection, and the like.

Wherein a network based on self-supervised learning can learn by itself from unlabelled images, generating labels for the images.

Contrast learning can simultaneously maximize the consistency between different enhanced images of the same original sample image and minimize the consistency between enhanced images of different original sample images. Therefore, based on the network obtained by the self-supervision contrast learning training, after the same unlabeled original sample image is enhanced for multiple times, different enhancements can be identified from the same original sample image, and therefore the similarity of different enhancement images can be maximized. If after several enhancements of different unlabeled original sample images, different enhancements from different original sample images can be identified, the similarity of the different enhanced images can be minimized.

Referring to fig. 1, a flowchart of a training method of a neural network provided in an embodiment of the present disclosure is shown, where the method includes S101 to S105, where:

s101: determining a first image feature of each first sample image in a plurality of first sample groups and a second image feature of a second sample image corresponding to each first sample image in a second sample group corresponding to the first sample group by using a target network to be trained; wherein the first sample image and the corresponding second sample image are different enhanced images of the same original sample image.

In the embodiment of the present disclosure, the original sample image is an unlabeled sample image, and the first sample image and the second sample image are enhanced images obtained by image enhancement processing on the basis of the original sample image. Image enhancement is an image processing method that adds some information or transform data to an original sample image by a certain means, selectively emphasizes interesting features in the original sample image or suppresses (masks) some unwanted features in the original sample image. The image enhancement processing may include flipping, cropping, color transformation, changing resolution, and the like, but is not limited to the above list, and any possible enhancement processing is within the scope of the present disclosure.

As shown in fig. 2, the effect schematic diagram of the sample images includes three original sample images, each original sample image corresponds to one first sample image and one second sample image, the first sample image and the second sample image corresponding to the same original sample image are obtained by performing image enhancement processing on the original sample image, and the first sample image and the second sample image both include partial image features in the original sample image, and as can be seen from fig. 2, the first sample image and the second sample image are obtained by performing different image enhancement processing, and thus the first sample image and the second sample image may be different enhanced images.

In some possible embodiments, image enhancement processing may be performed on each original sample image for multiple times to obtain multiple different sample images, so as to increase the number of sample images and improve the accuracy of training samples. However, considering that when the original sample image is subjected to image enhancement for a larger number of times, there may be a larger overlap of image features between different enhanced images, and no more image features are provided, which affects the diversity of the sample image, in the embodiment of the present disclosure, two enhanced images, namely, the first sample image and the second sample image, are obtained by performing only two kinds of image enhancement processes on the original sample image.

In a specific embodiment, two enhanced images of the same original sample image may be divided into two corresponding sample groups to obtain a first sample group and a second sample group, where the first sample group includes any one of the two sample images corresponding to the same original sample image, and the second sample group includes the remaining one of the two sample images corresponding to the same original sample image.

In a further embodiment, when there are many original sample images, the original sample image may be divided into a plurality of original sample image groups, and then for each original sample image group, two sample images of each original sample image in the original sample image group are divided into two corresponding sample groups to form a sample group pair corresponding to each original sample image group, so that a plurality of sample group pairs corresponding to the plurality of original sample image groups may be obtained. For the first sample group and the second sample group in each sample group pair, the first sample group includes any one of the two sample images corresponding to the same original sample image, and the second sample group includes the other remaining one of the two sample images corresponding to the same original sample image.

The target network to be trained refers to a convolutional neural network to be trained based on contrast learning, and the target network to be trained can perform feature extraction based on the input sample image to obtain image features corresponding to the sample image. Specifically, each first sample image may be input into a target network to be trained, so as to obtain a first image feature of each first sample image; and inputting each second sample image into a target network to be trained to obtain a second image characteristic of each second sample image.

In some embodiments, the target network to be trained may be a neural network for image classification, which may include a feature extraction module for extracting image features and a classifier for classifying based on the extracted image features. The first sample image and the second sample image may be input to a target network to be trained, and a first image feature and a second image feature output by a feature extraction module in the target network to be trained are obtained.

In the network training process, after the first sample group and the second sample group are obtained, the first sample group and the second sample group may be input into a target network to be trained, and an image feature corresponding to each sample image in each sample image group is obtained.

In the embodiment of the present disclosure, an original sample image group including a plurality of original sample images may be defined as { X }, a first sample image corresponding to each original sample image is X1, and a second sample image corresponding to each first sample image X1 is X2, then each original sample corresponds to one first sample group { X1} and one second sample group { X2} in the group { X }, as in the flow diagram of network training shown in fig. 3, the first sample group and the second sample group may be respectively input into the target network, and a first image feature h1= F (X1) corresponding to each first sample image in the first sample group and a second image feature h2= F (X2) corresponding to each second sample image in the second sample group may be extracted based on the target network.

S102: and classifying the plurality of first sample images based on the first image characteristics to obtain a first class representation of each first sample image, and classifying the plurality of second sample images based on the second image characteristics to obtain a second class representation of each second sample image.

In the embodiment of the present disclosure, S102 is used to mainly implement the self-supervised learning between the first sample image and the second sample image, and the embedded vector is used to explore the relationship between different enhanced images.

As described above, the first image feature and the second image feature are image features extracted based on the target network, and in order to learn the relationship between the enhanced images corresponding to different original sample images, the image features extracted by the target network need to be converted into class representations in an embedding space (embedding space), and then used as the supervision information to attract similar images in the embedding space. As shown in fig. 3, the head structure Φ may be used to obtain class representations in the embedding space, and the first image feature h1 corresponding to each first sample image and the second image feature h2 corresponding to each second sample image are input into the head structure Φ, so that the first class representation V1= Φ (h 1) in the embedding space corresponding to each first sample image and the second class representation V2= Φ (h 2) in the embedding space corresponding to each second sample image may be obtained.

Next, the cosine distance between the first sample images can be calculated based on the first class representation in the embedding space, so that the multiple first sample images are classified, and the first similarity information y1 between each first sample image and other first sample images in the same first sample group is obtained. Similarly, based on the second class characterization in the embedding space, the multiple second sample images are classified to obtain second similarity information y2 between each second sample image and other second sample images in the same second sample group.

In one possible implementation manner, the first similarity information and the second similarity information may be determined in a manner of a connected graph.

Specifically, two first sample images with similarity greater than a first preset threshold may be connected based on a first class representation in an embedding space, so as to generate at least one first connection subgraph. Here, the cosine similarity between the first class representations may be calculated, the distance between the two first sample images may be determined, and it may be considered that when the first sample image is similar to the second first sample image, and the second first sample image is similar to the third first sample image, then the first sample image is similar to the third first sample image, i.e., after constructing the connection between the first sample image and the second first sample image, and the connection between the second first sample image and the third first sample image, the connection between the first sample image and the third first sample image is constructed. Thus, in one first connection diagram, arbitrary two first sample images are connected to each other.

When the plurality of first sample images generate at least one first connection subgraph, for each first connection subgraph, determining similarity information corresponding to the first connection subgraph based on a first class representation corresponding to the first sample image in the first connection subgraph; and then, determining first similarity information between every two first sample images in the first sample group based on the similarity information corresponding to each first connection sub-image. Wherein any pair of first sample images in each first-communication sub-graph are similar, but the first sample images between different first-communication sub-graphs are dissimilar.

Similarly, two second sample images with similarity greater than a preset threshold value can be connected based on the second class characterization in the embedding space, and at least one second connected subgraph is generated. Here, the cosine similarity between the second class representations may be calculated, the distance between the two second sample images may be determined, and it may be considered that when the first second sample image is similar to the second sample image, and the second sample image is similar to the third second sample image, the first second sample image is similar to the third second sample image. That is, after constructing the connection between the first second sample image and the second sample image, and the connection between the second sample image and the third second sample image, the connection between the first second sample image and the third second sample image is constructed. Thus, within one second connected subgraph, any two second sample images are connected with each other.

When the plurality of second sample images generate at least one second connected subgraph, for each second connected subgraph, determining similarity information corresponding to the second connected subgraph based on a second class characterization corresponding to the second sample image in the second connected subgraph; and then, determining second similarity information between every two second sample images in the second sample group based on the similarity information corresponding to each second connected subgraph. Wherein any pair of second sample images in each second connected subgraph is similar, but the second sample images between different second connected subgraphs are not.

As shown in the schematic diagram of the connected graph shown in fig. 4, three connected subgraphs are formed, where in the connected subgraphs a-B-C-D-E, B, C, and E are first sample images whose similarity to a is greater than a first preset threshold, and D is a first sample image whose similarity to C is greater than the first preset threshold, according to the rule that when the first sample image is similar to the second first sample image and the second first sample image is similar to the third first sample image, the first sample image is similar to the third first sample image, then a, B, C, D, and E are first sample images whose similarities to each other are greater than the first preset threshold, so that a connected subgraph in which two of a, B, C, D, and E are connected can be formed. Each connected subgraph comprises a plurality of similar first sample images, and the first class representation of each first sample image in the connected subgraph can be determined quickly and accurately in a connected subgraph mode, so that the training efficiency and accuracy can be improved.

S103: first similarity information between first sample images in the first sample group is determined based on the first class characterization of each first sample image.

S104: second similarity information between second sample images in the second sample group is determined based on the second class characterization of each second sample image.

In S103 and S104, the cosine similarity may be used to respectively calculate first similarity information between two first sample images and second similarity information between two second sample images.

Specifically, the first similarity information may be represented as 1 or 0, where 1 represents similarity between two sample images, and 0 represents dissimilarity between two sample images.

S105: and taking the first similarity information as supervision information of the second sample group, taking the second similarity information as supervision information of the first sample group, and training the target network to be trained.

In the embodiment of the present disclosure, the first similarity information is used as the supervision information of the second sample group, the second similarity information is used as the supervision information of the first sample group, and the parameter of the target network is adjusted, so that after the parameter is adjusted, the similarity information between the second sample images in the second sample group obtained based on the prediction result of the target network on the second sample group is closer to the first similarity information, and the similarity information between the first sample images in the first sample group obtained based on the prediction result of the target network on the first sample group is closer to the second similarity information.

Alternatively, the first training loss may be determined based on the first class characteristics and the second similarity information of each of the first sample images, and the second training loss may be determined based on the second class characteristics and the first similarity information of each of the second sample images.

Based on the determined first training loss and second training loss, a sum of training losses of the first training loss and the second training loss for training the target network to be trained may be obtained, that is: l is _swap ＝L _sup (V ₁ ,y ₂ )+L _sup (V ₂ ,y ₁ ) Wherein L is _sup (V ₁ ,y ₂ ) For the first training loss, L _sup (V ₂ ,y ₁ ) Is lost for the second training, an

Vi, vj, vk respectively and sequentially represent elements corresponding to the ith first sample image, the jth first sample image, and the kth first sample image of the first sample group in V1, or respectively represent elements corresponding to the ith second sample image, the jth second sample image, and the kth second sample image of the second sample group in V2, τ is an enhancement function, N is the number of the first sample images or the second sample images, and a trained target network can be obtained based on the obtained number until training is completed.

In the embodiment of the present disclosure, the head structure phi is used as an auxiliary head structure of a single head structure in the prior art, on one hand, based on the first image feature of the first sample image and the second image feature of the second sample image, similar sample clustering can be performed on enhanced images from different original sample images, so that the classification accuracy can be effectively improved, and on the other hand, the similarity information of each group of enhanced images is used as the supervision information of other groups of enhanced images, so that self-supervision learning can be performed on a target network, and the training efficiency of the target network is improved. The head structure phi overcomes the defect that each sample image is regarded as a category and then classified when only one head structure is used for example judgment in the prior art, and similar sample images cannot be classified.

To fully explain the training process of the target network, the following describes the training process of the target network with reference to the header structure G and the auxiliary header structure phi for example discrimination.

As shown in fig. 3, after the first image feature h1 of the first sample image and the second image feature h2 of the second sample image are extracted based on the target network, the first image feature h1 and the second image feature h2 may be input into the head structure G to obtain a fourth image feature z1 of the first sample image and a fifth image feature z2 of the second sample image, and then a third training loss may be determined based on the fourth image feature z1 of each first sample image and the fifth image feature z2 of each second sample image.

For each first sample image, based on the fourth image feature z1 of the first sample image and the fifth image feature z2 of the second sample image corresponding to the first sample image, determining loss information corresponding to the first sample image or loss information corresponding to the second sample image, and based on the loss information corresponding to each first sample image or loss information corresponding to the second sample image, determining a third training loss, namely, a third training loss

Where τ is the enhancement function, N is the number of the first sample image or the second sample image, and zi and zj are the fourth image feature of the first sample image and the fifth image feature of the second sample image corresponding to the nth original sample image.

Considering that the diversity of the sample images plays a key role in the contrast learning, the effect of the contrast learning is better as the variety of the sample images increases. In a possible implementation manner, a plurality of third sample images corresponding to each original sample image may be acquired respectively; and then, taking each third sample image in the plurality of third sample images as the first sample image and the second sample image simultaneously to train the target network.

Here, the process of obtaining the third sample image may be to directly perform enhancement processing on the original sample image. In a specific implementation process, a preset number of original sample images may be subjected to enhancement processing once respectively to obtain a plurality of third sample images, and the enhancement modes of different third sample images may also be different. In particular implementations, the third sample image is enhanced in a different manner than the first and second sample images are generated. Specifically, the original sample image may be subjected to a first enhancement process (for example, contrast enhancement) to obtain a first sample image, the original sample image may be subjected to a second enhancement process (the second enhancement process may be the same as the first enhancement process, and here, contrast enhancement may also be performed) to obtain a second sample image, and the original sample image may be subjected to a third enhancement process (for example, contrast reduction) to obtain a third sample image.

In the training process, the third sample image can be added into the first sample group to form a first sample image, and the target network is trained; and simultaneously adding the third sample image into the second sample group to form a second sample image, and training the target network.

In a possible implementation manner, for each original sample image, a target image feature corresponding to each other original sample image except the original sample image may also be determined, where the target image feature corresponding to the other original sample image is a first image feature or a second image feature corresponding to the other original sample image; respectively determining the similarity between the first image characteristic or the second image characteristic of the original sample image and the target image characteristic corresponding to each other original sample image; and acquiring a plurality of third sample images corresponding to the original sample image based on the determined similarity.

As already explained above, for each original sample image, after enhancement, a first sample image and a second sample image corresponding to the first sample image are obtained, where a target image feature corresponding to each original sample image may be determined, where the target image feature is a first image feature or a second image feature corresponding to each original sample image except the original sample image, that is, for each original sample image except the original sample image, only the first image feature or the second image feature is selected as a target image feature corresponding to the original sample image, where it is mainly considered that the first sample image and the corresponding second sample image are the same original sample image, and the similarity between the first image feature corresponding to the first sample image and the second image feature corresponding to the second sample image is higher, and it is not necessary to generate a third sample image based on the first sample image and the corresponding second sample image at the same time, so that the first image feature or the second image feature is selected as a data set for selecting the third sample image.

Specifically, the first image feature or the second image feature of the original sample image and the similarity between the target image features corresponding to each of the other original sample images may be respectively determined, then the first sample image or the second sample image corresponding to the target image features obtained by screening may be based on the sequence of the similarity from high to low, and then the image enhancement processing may be performed on the first sample image or the second sample image obtained by screening to obtain a plurality of corresponding third sample images.

Considering that the accuracy of extracting the image features of the target network and the target head structure is not high when the training times are few, the acquisition mode of the third sample image can be selected according to the current iteration times and the preset iteration times.

In a possible embodiment, a current first iteration number may first be determined; and then respectively acquiring a plurality of third sample images generated based on each original sample image under the condition that the first iteration times are less than the preset iteration times. That is, when the current iteration number is small, the third sample image can be obtained by generating the third sample image from the original sample image, and in the process, the third sample image can be obtained without using a target network and a target head structure to extract image features, so that the accuracy of the image features is not affected.

In a possible embodiment, the current second iteration number is first determined; and then determining the target image characteristics corresponding to each other original sample image except the original sample image aiming at each original sample image under the condition that the second iteration times is greater than the preset iteration times. That is, when the current iteration number is large, the image enhancement processing may be performed on the first sample image or the second sample image corresponding to the target image feature obtained by the screening, through the target image feature corresponding to each other original sample image except for each original sample image, so as to generate the third sample image. Because the second iteration times are greater than the preset iteration times, the current target network and the target head structure have stronger capability of extracting the image characteristics, and the accuracy of the extracted image characteristics is higher.

In a specific implementation process, when the current first iteration number is less than the first 25% of the preset total iteration number, a plurality of third sample images generated based on each original sample image may be obtained, and then when the current second iteration number is greater than the last 75% of the preset total iteration number, image enhancement processing may be performed on the first sample image or the second sample image corresponding to the target image feature obtained by screening, so as to generate a third sample image.

After the plurality of third sample images are obtained, determining a third image feature of each third sample image by using a target network to be trained, and classifying the plurality of third sample images based on the third image features to obtain a third class representation of each third sample image; for each third sample image, respectively determining second loss information between each first sample image and the third sample image based on the third class characterization of the third sample image and the first class characterization of each first sample image; for each third sample image, respectively determining third loss information between each second sample image and the third sample image based on the third class characterization of the third sample image and the second class characterization of each second sample image; determining a fourth training loss based on the determined second loss information and the third loss information; and training the target network to be trained based on the first training loss, the second training loss, the third training loss and the fourth training loss.

The third image feature may be an image feature extracted by the target network. Then, a third image feature may be input into the head structure phi to obtain a third class representation of the third sample image, then, based on the third class representation of the third sample image and the first class representation of each first sample image, fourth similarity information of the third sample image and each first sample image may be calculated, and then, second loss information between each first sample image and the third sample image may be determined according to the fourth similarity information and the first similarity information. And calculating fifth similarity information of the third sample image and each second sample image according to the third class characterization of the third sample image and the second class characterization of each second sample image, and then determining third loss information between each second sample image and the third sample image according to the fifth similarity information and the second similarity information.

Next, a fourth training loss, which may be expressed as L herein, may be determined by weighted summation of second loss information determined based on second loss information between each first sample image and each third sample image, and third loss information between each second sample image and each third sample image _cswap . The fourth training loss is the training loss between similar samples obtained using the head structure phi after the third sample image is added.

Further, fourth loss information between each first sample image and each third sample image can be respectively determined according to the third image characteristics of the third sample image and the first image characteristics of each first sample image; for each third sample image, determining fifth loss information between each second sample image and the third sample image based on third image features of the third sample image and second image features of each second sample image respectively; determining a fifth training loss based on the determined fourth loss information and the fifth loss information; and training the target network to be trained based on the first training loss, the second training loss, the third training loss, the fourth training loss and the fifth training loss.

In the above process, the third image feature may be input into the head structure G to obtain a sixth image feature of the third sample image, and then, for each third sample image, fourth loss information between each first sample image and the third sample image is respectively determined based on the sixth image feature of the third sample image and a seventh image feature of the first image feature of each first sample image after being processed by the head structure G; and for each third sample image, determining fifth loss information between each second sample image and the third sample image respectively based on the sixth image features of the third sample image and the eighth image features of the second image features of each second sample image after the head structure G processing.

Wherein, respectively determining fourth loss information between each of the first sample images and the third sample image based on the sixth image feature of the third sample image and the seventh image feature of each of the first sample images may be calculating similarity information between the third sample image and each of the first sample images based on the sixth image feature of the third sample image and the seventh image feature of each of the first sample images, and then respectively determining fourth loss information between each of the first sample images and the third sample image according to the similarity information. The process for the fifth loss information is similar and will not be described here.

Next, a fifth training loss, which may be denoted as L herein, may be determined by weighted summation based on fourth loss information between each of the first and third sample images and fifth loss information between each of the second and third sample images _cNCE 。

Finally, a total loss L can be obtained from the first training loss, the second training loss, the third training loss, the fourth training loss, and the fifth training loss _overall ＝L _NCE +λL _cNCE +βL _swap +γL _cswap Wherein, L _swap Is the sum of the first training loss and the second training loss, L _NCE For the third training loss, L _cswap For the fourth training loss, L _cNCE For the fifth training loss, λ, β, γ are weight hyperparameters.

The trained target network can cluster similar sample images without taking each sample image as one type, so that the classification accuracy is improved, for example, in the effect graph shown in fig. 5 after the sample images are clustered, the sample images containing airplanes are clustered into one type, the sample images containing parrots are clustered into one type, the sample images containing cats are clustered into one type, and the sample images containing giraffes are clustered into one type.

The method and the device cluster the sample images, and use the first similarity information and the second similarity information as the supervision information, so that on one hand, classification is performed according to similar samples, and classification accuracy can be effectively improved compared with the classification of each image as a category, on the other hand, the similarity information of each group of enhanced images is used as the supervision information of other groups of enhanced images, self-supervision learning can be performed on a target network, and the training efficiency of the target network is improved; meanwhile, the same original sample image is enhanced twice to obtain two corresponding enhanced images (the first sample image and the second sample image), so that the situation that the number of times of enhancing the same original sample image is large, and when the image features of different enhanced images are overlapped, the diversity of the sample images is reduced is avoided.

It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a training apparatus for a neural network corresponding to the training method for the neural network, and since the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the training method for the neural network described above in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 6, a schematic architecture diagram of a training apparatus for a neural network provided in an embodiment of the present disclosure is shown, where the apparatus includes: a first determining module 601, a classifying module 602, a second determining module 603, a third determining module 604 and a training module 605; wherein, the first and the second end of the pipe are connected with each other,

a first determining module 601, configured to determine, by using a target network to be trained, a first image feature of each first sample image in a first sample group, and a second image feature of a second sample image corresponding to each first sample image in a second sample group corresponding to the first sample group; the first sample image and the corresponding second sample image are different enhanced images of the same original sample image;

a classification module 602, configured to perform classification processing on the multiple first sample images based on the first image features to obtain a first class characterization of each first sample image, and perform classification processing on the multiple second sample images based on the second image features to obtain a second class characterization of each second sample image;

a second determining module 603, configured to determine, based on the first class characterization of each first sample image, first similarity information between every two first sample images in the first sample group;

a third determining module 604, configured to determine second similarity information between two second sample images in the second sample group based on the second class characterization of each second sample image;

a training module 605, configured to use the first similarity information as the supervision information of the second sample group, use the second similarity information as the supervision information of the first sample group, and train the target network to be trained.

The sample images are clustered, and the first similarity information and the second similarity information are used as supervision information, so that on one hand, classification is performed according to similar samples, and on the other hand, classification accuracy can be effectively improved compared with the classification of each image as a category, and on the other hand, the similarity information of each group of enhanced images is used as supervision information of other groups of enhanced images, so that self-supervision learning can be performed on a target network, and the training efficiency of the target network is improved; meanwhile, the same original sample image is enhanced twice to obtain two corresponding enhanced images (the first sample image and the second sample image), so that the situation that when the image characteristics among different enhanced images are overlapped more due to more times of enhancing the same original sample image is avoided, and the diversity of the sample images is reduced.

In a possible implementation, the training module 605 is specifically configured to: determining a first training loss based on the first class characteristics of each first sample image and the second similarity information;

In a possible implementation, the method further comprises:

a fourth determining module, configured to determine a third training loss based on the first image feature of each first sample image and the corresponding second image feature of each second sample image;

the training module 605 is specifically configured to: training the target network to be trained based on the first training loss, the second training loss, and the third training loss.

In a possible implementation manner, the fourth determining module is specifically configured to: for each first sample image, determining loss information corresponding to the first sample image based on first image features of the first sample image and second image features of a second sample image corresponding to the first sample image;

In a possible implementation, the classification module 602 is specifically configured to:

based on the first class representation of each first sample image, connecting two first sample images with similarity larger than a first preset threshold value to generate at least one first communication graph;

the determining second similarity information between two second sample images in the second sample group based on the second class characterization of each second sample image comprises:

In a possible implementation, the method further comprises:

the acquisition module is used for respectively acquiring a plurality of third sample images corresponding to each original sample image; wherein the third sample image is an enhanced image of the corresponding original sample image;

a training module, configured to train the target network by using each of the plurality of third sample images as a first sample image and a second sample image at the same time;

in a possible implementation manner, the obtaining module is specifically configured to:

determining a current first iteration number;

determining target image features corresponding to each original sample image except the original sample image, wherein the target image features corresponding to other original sample images are first image features or second image features corresponding to other original sample images;

and generating a plurality of third sample images based on the first sample image or the second sample image corresponding to the target image feature obtained by screening.

determining a current second iteration number;

In a possible embodiment, the method further comprises:

a fifth determining module, configured to determine, by using a target network to be trained, a third image feature of each third sample image;

the training module is specifically configured to perform classification processing on the plurality of third sample images based on the third image features to obtain a third class representation of each third sample image;

for each third sample image, determining third loss information between each second sample image and the third sample image based on the third class characterization of the third sample image and the second class characterization of each second sample image, respectively;

In a possible embodiment, the training module is specifically configured to determine, for each third sample image, fourth loss information between each first sample image and the third sample image based on third image features of the third sample image and first image features of each first sample image, respectively;

for each third sample image, determining fifth loss information between each second sample image and the third sample image based on third image features of the third sample image and second image features of each second sample image respectively;

The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 7, a schematic structural diagram of a computer device 700 provided in the embodiment of the present disclosure includes a processor 701, a memory 702, and a bus 703. The memory 702 is used for storing execution instructions and includes a memory 7021 and an external memory 7022; the memory 7021 is also referred to as an internal memory, and is used to temporarily store operation data in the processor 701 and data exchanged with an external memory 7022 such as a hard disk, the processor 701 exchanges data with the external memory 7022 through the memory 7021, and when the computer apparatus 700 is operated, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions:

based on the first image characteristics, carrying out classification processing on the multiple first sample images to obtain a first class characterization of each first sample image, and based on the second image characteristics, carrying out classification processing on the multiple second sample images to obtain a second class characterization of each second sample image;

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the training method for a neural network described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the neural network training method described in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK) or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used to illustrate the technical solutions of the present disclosure, but not to limit the technical solutions, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of training a neural network, comprising:

2. The method according to claim 1, wherein the training the target network to be trained by using the first similarity information as the supervision information of the second sample group and the second similarity information as the supervision information of the first sample group comprises:

3. The method of claim 2, further comprising:

determining a third training loss based on the first image features of each first sample image and the corresponding second image features of each second sample image;

4. The method of claim 3, wherein determining a third training loss based on the first image features of each first sample image and the corresponding second image features of each second sample image comprises:

5. The method according to any one of claims 1 to 4, wherein the determining of the first similarity information between each two first sample images in the first sample group based on the first class characterization of each first sample image comprises:

6. The method of claim 3, further comprising:

7. The method according to claim 6, wherein the separately obtaining a plurality of third sample images corresponding to each original sample image comprises:

determining a current first iteration number;

8. The method according to claim 6, wherein the separately acquiring a plurality of third sample images corresponding to each original sample image comprises:

9. The method according to claim 8, wherein the obtaining a plurality of third sample images corresponding to the original sample image based on the determined third similarity information includes:

screening the target image characteristics of which the third similarity information meets preset conditions;

10. The method of claim 8, wherein the determining, for each original sample image, the target image feature corresponding to each other original sample image except the original sample image comprises:

determining a current second iteration number;

11. The method of any one of claims 6 to 10, further comprising, after acquiring the plurality of third sample images:

the training the target network by using each third sample image in the third sample images as a first sample image and a second sample image simultaneously comprises:

12. The method of claim 11, wherein the training the target network to be trained based on the first training loss, the second training loss, the third training loss, and the fourth training loss comprises:

13. An apparatus for training a neural network, comprising:

the device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining a first image characteristic of each first sample image in a first sample group and a second image characteristic of a second sample image corresponding to each first sample image in a second sample group corresponding to the first sample group by using a target network to be trained; the first sample image and the corresponding second sample image are different enhanced images of the same original sample image;

the classification module is used for classifying the plurality of first sample images based on the first image characteristics to obtain a first class representation of each first sample image, and classifying the plurality of second sample images based on the second image characteristics to obtain a second class representation of each second sample image;

the second determining module is used for determining first similarity information between every two first sample images in the first sample group based on the first class representation of each first sample image;

a third determining module, configured to determine second similarity information between every two second sample images in the second sample group based on a second class characterization of each second sample image;

and the training module is used for taking the first similarity information as the supervision information of the second sample group and taking the second similarity information as the supervision information of the first sample group to train the target network to be trained.

14. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine-readable instructions when executed by the processor performing the steps of the method of training a neural network of any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of training a neural network as claimed in any one of claims 1 to 12.