CN114462290A

CN114462290A - Method and device for generating pre-training artificial intelligence model

Info

Publication number: CN114462290A
Application number: CN202110173151.7A
Authority: CN
Inventors: 魏龙辉; 谢凌曦; 何建忠; 田奇
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-10-31
Filing date: 2021-02-08
Publication date: 2022-05-10

Abstract

The embodiment of the application relates to a method for generating a pre-training AI model, which comprises the following steps: and determining a plurality of original images, and determining the apparent features corresponding to the original images according to the apparent feature extraction model. And then, generating a corresponding pseudo label for each original image according to the appearance characteristic and the original label corresponding to each original image. And then, training the first initial AI model by adopting the original data carrying the pseudo label to obtain a pre-training AI model. According to the method, the apparent features of the original image are extracted by adopting the apparent feature extraction model with strong generalization capability, and then the pseudo label is generated based on the apparent features and the original label pre-stored in the original image. The pseudo label has apparent characteristics and artificial semantics, so that the pre-training AI model obtained by training the original image carrying the pseudo label inherits the apparent characteristics with generalization capability and has characteristic capture capability with more rich fine granularity.

Description

Method and device for generating pre-training artificial intelligence model

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to a method and an apparatus for generating a pre-trained AI model.

Background

With the continuous development of AI, the application of AI is becoming more and more extensive. For different application scenarios, people often need to use AI models suitable for the characteristics of the corresponding application scenarios. Different AI models trained from data of different application scenarios can be used to perform specific tasks of the corresponding application scenarios. However, the way of constructing and training the dedicated AI model for the corresponding application scenario or task according to the corresponding application scenario or task causes the dedicated AI model to consume a lot of resources and time when constructing and training. Therefore, the pre-trained AI model can be obtained in advance, and then further training is carried out on the pre-trained AI model by combining the downstream task training data under the corresponding scene, so that the exclusive AI model meeting the requirements of the corresponding application scene is obtained.

However, the current method for obtaining the pre-trained AI model is difficult to obtain the pre-trained AI model which can satisfy both the strong generalization ability and the good precision.

Disclosure of Invention

The embodiment of the application provides a method for generating a pre-training AI model, which is characterized in that an apparent characteristic extraction model with good overall performance is obtained by performing self-supervision learning on an original image. And then adding corresponding pseudo labels to each original image based on the obtained apparent feature extraction model. And performing full supervision comparison learning by adopting the image added with the pseudo label to obtain a universal pre-training AI model. The general pre-training AI model can have the generalization capability of the global appearance characteristics and the semantic characteristics contained in the pseudo labels.

In a first aspect, a method for generating a pre-trained AI model is provided, the method comprising: and determining the apparent features corresponding to each original image by adopting a plurality of original images and a pre-trained apparent feature extraction model. And then, generating a corresponding pseudo label for each original image according to the apparent characteristic corresponding to each original image and the original label of each original image. In some examples, each original image may have at least one pseudo label. And then, training the first initial AI model by using the original image carrying the pseudo label to obtain a pre-trained AI model. For example, fully supervised contrast learning training may be performed to obtain a pre-trained AI model. According to the method, the apparent features of the original image are extracted by adopting the apparent feature extraction model with strong generalization capability, and then the pseudo label is generated based on the apparent features and the original label pre-stored in the original image. The pseudo label has apparent information and artificial semantics, and the pre-training AI model is obtained by training the original image carrying the pseudo label. The pre-trained AI model can have the appearance characteristics of generalization capability and also has the characteristic capture capability of fine granularity and richer.

In one possible embodiment, determining the pseudo label of each original image according to the apparent feature of each original image and the label preset by the original image may include: for each raw image, an apparent feature of the raw image, i.e. a first apparent feature, is determined. And determining the apparent feature of at least one other original image with the same original label as the original image, wherein in one example, the apparent feature of at least one other original image with the same original label as the original image in the plurality of original images can be called as a second apparent feature. Thereafter, a similarity of the first apparent feature to the second apparent feature is determined. And generating a pseudo label by the original image and other original images corresponding to the second apparent features of which the similarity with the first apparent features meets the preset conditions. According to the method and the device, a plurality of original images with the same original label are subdivided, and a pseudo label with richer fine granularity is generated according to the similarity of apparent features between different original images.

In one possible embodiment, determining the pseudo labels of the original image and the other original images with the similarity to the first apparent feature satisfying the preset condition may include: and determining k other original images according to the similarity of the first apparent feature and the second apparent feature. And determining the pseudo labels for the original image corresponding to the first apparent feature and the selected k other original images. In some examples, k is a positive integer. According to the method and the device, the pseudo labels with richer fine granularity are generated for the partial similar original images with the same original labels, so that the original images have richer artificial semantics. So as to finally train out a pre-trained AI model with fine-grained richer feature capture capability.

In one possible embodiment, determining the pseudo labels of the original image and the k other original images may include: at least one other original image is ranked according to the similarity of the first apparent feature to the second apparent feature. The first k other original images and the original image are then selected to generate pseudo labels. The method and the device can select k other original images by sequencing the similarity. Therefore, the k selected original images are ensured to be more attached to the generated pseudo labels. So as to finally train out a pre-trained AI model with fine-grained richer feature capture capability.

In one possible embodiment, determining the pseudo labels of the original image and the k other original images may include: and determining other original images with the similarity greater than or equal to a preset similarity threshold according to the similarity of the first apparent feature and the second apparent feature. And then selecting the original image and other original images with the similarity greater than or equal to the similarity threshold value to generate labels. According to the method and the device, the similarity threshold value can be preset, and the original image with the similarity higher than the similarity threshold value and the original image can be generated as the label, so that the original image has richer artificial semantics, and a pretrained AI model with fine-grained richer feature capturing capability is finally trained.

In one possible embodiment, when the number of other original images with similarity greater than or equal to the similarity threshold is greater than or equal to k, the k other original images may be determined according to a preset rule. And adding pseudo labels to the original image and the k other original images. In some examples, k is a positive integer. In the method and the device, the pseudo labels can be generated only for the other original images with part of similarity greater than or equal to the similarity, so that the original images far from the semantics of the pseudo labels are prevented from being generated, and overfitting of the pre-trained AI model on fine-grained division is further avoided.

In one possible embodiment, the pseudo labels of the original image and other original images corresponding to the second apparent features whose similarity to the first apparent feature satisfies the preset condition are the same. In one example, the pseudo label is a subclass of the original label.

In one possible embodiment, determining the similarity between the first apparent characteristic and the second apparent characteristic may include: for each raw image, a first apparent feature and a second apparent feature are determined from an apparent feature extraction model. And determining the cosine distance of the first apparent feature and the second apparent feature, and taking the cosine distance as the similarity of the first apparent feature and the second apparent feature. The method and the device adopt the cosine distance to better reflect the similarity degree between two original images.

In one possible embodiment, training the first initial AI model by using a plurality of original images carrying pseudo labels to obtain a pre-trained AI model may include: and taking a plurality of original images carrying the pseudo labels as the input of the first initial AI model, determining the difference between the output value and the true value of the first initial AI model by adopting a contrast loss function, and reversely adjusting the network parameters in the first initial AI model. And obtaining and training the AI model after a plurality of iterations. According to the method and the device, iteration is performed on the first initial AI model by adopting the original image carrying the pseudo label, so that the pre-trained AI model inheriting the appearance characteristics with generalization capability and having more rich characteristic capture capability with fine granularity is obtained. In one example, training the first initial AI model may employ fully supervised contrast learning.

In one possible implementation, the apparent feature extraction model may be determined by performing an auto-supervised learning training on the second initial AI model using a plurality of original images in advance.

In one possible embodiment, the method may further comprise: a plurality of training data under an application scene are acquired. And then, training the pre-trained AI model by adopting a plurality of training data to determine a target AI model aiming at the application scene. The method and the device can utilize the pre-training AI model to carry out targeted training, thereby meeting various different task requirements.

In a second aspect, an apparatus for generating a pre-trained AI model is provided, the apparatus comprising: an acquisition unit may be used to acquire a plurality of original images. A pseudo label generation unit, which can be used for determining the apparent characteristics of a plurality of original images according to the plurality of original images and the apparent characteristic extraction model; and determining a pseudo label of each original image according to the apparent feature of each original image and the preset original label of each original image. And the first training unit is used for training the first initial AI model by adopting a plurality of original images carrying pseudo labels to obtain a pre-training AI model. According to the method, the apparent features of the original image are extracted by adopting the apparent feature extraction model with strong generalization capability, and then the pseudo label is generated based on the apparent features and the original label pre-stored in the original image. The pseudo label has apparent information and artificial semantics, and the pre-training AI model is obtained by training the original image carrying the pseudo label. The pre-trained AI model can have the appearance characteristics of generalization capability and also has the characteristic capture capability of fine granularity and richer.

In one possible embodiment, the pseudo tag generating unit is further configured to: for each raw image, a similarity of the first apparent feature to the second apparent feature is determined. Wherein the first apparent feature is an apparent feature of the original image, and the second apparent feature is an apparent feature of at least one other original image of the plurality of original images having the same original label as the original image. And determining the pseudo labels of the original image and other original images corresponding to the second apparent features of which the similarity with the first apparent features meets the preset condition. According to the method and the device, a plurality of original images with the same original label are subdivided, and a pseudo label with richer fine granularity is generated according to the similarity of apparent features between different original images.

In one possible embodiment, the pseudo tag generating unit is further configured to: and determining the pseudo labels of the original image and the k other original images according to the similarity of the first apparent feature and the second apparent feature. In one example, k is a positive integer. According to the method and the device, the pseudo labels with richer fine granularity are generated for the partial similar original images with the same original labels, so that the original images have richer artificial semantics. So as to finally train out a pre-trained AI model with fine-grained richer feature capture capability.

In one possible embodiment, the pseudo tag generating unit is further configured to: and sequencing at least one other original image according to the similarity of the first apparent feature and the second apparent feature, and determining the pseudo labels of the original image and the first k other original images in the sequence. The method and the device can select k other original images by sequencing the similarity. Therefore, the k selected original images are ensured to be more attached to the generated pseudo labels.

In one possible embodiment, the pseudo tag generating unit is further configured to: and determining other original images with the similarity greater than or equal to the similarity threshold according to the similarity of the first apparent feature and the second apparent feature. And determining the pseudo labels of the original image and other original images with the similarity larger than or equal to the similarity threshold value. According to the method and the device, the similarity threshold value can be preset, and the original image with the similarity higher than the similarity threshold value and the original image can be generated as the label, so that the original image has richer artificial semantics, and a pretrained AI model with fine-grained richer feature capturing capability is finally trained.

In one possible embodiment, the pseudo tag generating unit is further configured to: and if the number of the other original images with the similarity larger than or equal to the similarity threshold is larger than or equal to k, determining k other original images according to a preset rule. In one example, k is a positive integer. The pseudo-labels of the original image and k other original images are determined. In the method and the device, the pseudo labels can be generated only for the other original images with part of similarity greater than or equal to the similarity, so that the original images far from the semantics of the pseudo labels are prevented from being generated, and overfitting of the pre-trained AI model on fine-grained division is further avoided.

In one possible embodiment, the pseudo labels of the original image and other original images corresponding to the second apparent features whose similarity to the first apparent feature satisfies the preset condition are the same. Wherein, the pseudo label is a subclass of the original label.

In one possible embodiment, the pseudo tag generating unit is further configured to: for each raw image, a first apparent feature and a second apparent feature are determined from the apparent feature extraction model. And determining cosine distances of the first apparent feature and the second apparent feature, and taking the cosine distances as the similarity of the first apparent feature and the second apparent feature. The method and the device adopt the cosine distance to better reflect the similarity degree between two original images.

In one possible embodiment, the first training unit is further configured to: and taking a plurality of original images carrying the pseudo labels as input of the first initial AI model, and iterating network parameters in the first initial AI model by adopting a contrast loss function to obtain a pre-training AI model. According to the method and the device, iteration is performed on the first initial AI model by adopting the original image carrying the pseudo label, so that the pre-trained AI model inheriting the appearance characteristics with generalization capability and having more rich characteristic capture capability with fine granularity is obtained. In one example, training the first initial AI model may employ fully supervised contrast learning.

In one possible embodiment, the apparatus further comprises: and the second training unit is used for carrying out self-supervision learning training on the second initial AI model by adopting a plurality of original images in advance and determining the apparent feature extraction model.

In one possible embodiment, the obtaining unit is further configured to: a plurality of training data under an application scene are acquired. The device still includes: and the third training unit is used for training the pre-trained AI model by adopting a plurality of training data and determining a target AI model aiming at the application scene. The method and the device can utilize the pre-training AI model to carry out targeted training, thereby meeting various different task requirements.

In a third aspect, a computing device is provided, comprising: a processor and a memory; when the processor reads and executes the instructions stored in the memory, the computing device performs the method of any of the first aspects described above.

In a fourth aspect, there is provided a computer-readable storage medium having stored therein instructions which, when run on a computing device, cause the computing device to perform the method of any of the first aspects.

The application discloses a method and a device for generating a pre-training AI model, which are characterized in that the apparent features of an original image are extracted through an apparent feature extraction model with strong generalization capability, and then a pseudo label is generated based on the apparent features and the original label pre-stored in the original image. The pseudo label has apparent characteristics and artificial semantics, and the pre-training AI model is obtained by training the original image carrying the pseudo label. The pre-trained AI model inherits the apparent characteristics with generalization capability and has the characteristic capture capability with more rich fine granularity.

Drawings

Fig. 1 is a schematic view of an application scenario of a pre-trained AI model according to an embodiment of the present disclosure;

fig. 2 is a schematic view of an application scenario of another pre-trained AI model according to an embodiment of the present disclosure;

FIG. 3 is a block diagram of a system for generating a pre-trained AI model according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for generating a pre-trained AI model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an artificial intelligence model network according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of another artificial intelligence model network structure provided in the embodiments of the present application;

FIG. 7 is a diagram of an auto-supervised contrast learning architecture;

FIG. 8 is a schematic diagram of a fully supervised contrast learning system;

fig. 9 is a flowchart of a method for applying a pre-trained AI model according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of an apparatus for generating a pre-trained AI model according to an embodiment of the present disclosure;

fig. 11 is a schematic diagram of another apparatus for generating a pre-trained AI model according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The application is mainly applied to a pre-training scenario of an AI model, such as shown in fig. 1. It can be seen that when the exclusive AI model is obtained, i.e. the fine-grained object classification AI model, assistance of the pre-training AI model is needed. Therefore, before obtaining the fine-grained object classification AI model, the initial AI model needs to be pre-trained through the original data to obtain a pre-trained AI model. Therefore, the pre-trained AI model is trained by combining with downstream training data under different application scenes, and an exclusive AI model suitable for the corresponding application scene is obtained. For example, the application scenario may be an application scenario based on fine-grained object classification, which has a wide application, such as flower and bird recognition, garbage classification, pet recognition, etc., and a strong professional knowledge background is often needed for labeling training data of the application scenario, or called as a label. In actual business, the amount of training data that can be acquired by the fine-grained object classification task is often small, and only a small number of training samples exist, so that it is difficult to acquire a proper AI model meeting requirements by directly using the training data of the application scene to train the initial AI model. Therefore, pre-training AI models are required for assistance. It should be noted that the initial AI model referred to in this application represents an untrained AI model. The pre-trained AI model is obtained by pre-training the initial AI model through the original data. The pre-training AI model can be applied to different application scenes, and then is correspondingly trained by combining with downstream training data, so that exclusive AI models under different application scenes can be obtained.

After a pre-training AI model is obtained in a scene of fine-grained object classification, the pre-training AI model is trained by combining a small amount of fine-grained object classification data, and then the fine-grained object classification AI model can be obtained. It can be understood that the fine-grained object classification AI model in the graph is a proprietary AI model of the fine-grained object classification scene. The fine-grained object classification AI model can execute a corresponding classification task according to the content of the fine-grained object classification data, for example, if the fine-grained object classification data is flower and bird data, the fine-grained object classification AI model can classify the flower and bird; for example, if the fine-grained object classification data is garbage classification data, the fine-grained object classification AI model can classify and identify different types of garbage.

Of course, in other application scenarios, for example, the application scenario based on object detection shown in fig. 2, the pre-trained AI model may be trained by combining the object detection data after the pre-trained AI model is obtained, so as to obtain the object detection AI model. The object detection data in the scene of object detection may be general object detection data or specific object detection data. It is understood that either generic object detection data or specific object detection data is less than image classification. Therefore, the dedicated AI model still needs to be obtained with the aid of a pre-trained AI model. At present, the general method is to train on image classification data to obtain a pre-trained AI model with better training, and then train the pre-trained AI model in combination with object detection data, so as to accelerate the convergence speed of the model or improve the accuracy after final convergence. And finally obtaining an exclusive AI model for specific object detection, namely the object detection AI model. It can be understood that if the general object detection data is adopted, the object detection AI model can finally identify a general object; if the specific object detection data is used, the object detection AI model may eventually identify the specific object.

The skilled person can understand that the present application can also be applied to application scenarios such as speech recognition, character recognition, etc., and the difference is only that the data adopted during training is different, thereby finally causing the trained dedicated AI model to be applied to the corresponding scenario. The data used during training may include raw data used during generation of the pre-trained AI model and downstream training data used during training of downstream tasks. It is understood that the sample, training sample, and training data referred to herein are the same concept.

When the initial AI model is pre-trained by using the raw data to obtain a pre-trained AI model, the pre-trained AI model can be usually trained by using self-supervised learning or using fully-supervised learning. The self-supervised learning method may also be referred to as self-supervised learning or unsupervised learning, and is a method that does not need to add a manual label to the original data, and can directly learn on the original data and learn corresponding knowledge only by setting an agent task for guidance. The fully-supervised learning method may also be referred to as fully-supervised learning or supervised learning, and the fully-supervised learning method is a learning method in which a function is deduced from a labeled training data set, and manual labels are used as guides to learn to obtain knowledge related to a target task. It is understood that knowledge learned in the machine learning domain through either self-supervised learning or fully-supervised learning, i.e. individual parameters within the respective AI model.

The self-supervision learning is one of popular research directions in the field of computer vision at present, and compared with a full-supervision learning method, the self-supervision contrast learning method can learn features with stronger generalization capability, and the features can be better used as parameters of a pre-training model to improve the performance of downstream tasks. Such as object detection or instance segmentation.

Although from the current experimental results, even if the training data sets are the same in scale, the self-supervised learning method is more suitable as the learning scheme of the pre-training AI model than the fully-supervised learning method. But this result is unthinkable from the point of view of information theory. Manual tagging can be viewed as adding some semantic meaning summarized by humans to the raw data. Thus, tagged datasets contain more knowledge than untagged datasets. Ideally, the knowledge learned on the pre-training data set by the self-supervised learning method should be a subset of the knowledge learned by the fully-supervised learning method. Therefore, the fully supervised learning approach should perform better in downstream tasks. But in practice the self-supervised learning method performs better in downstream tasks due to overfitting. The fully supervised learning method usually only focuses on some areas related to the target task, so that invariant semantic features can be acquired. Clearly, such semantic features are a biased knowledge that can only work well on target tasks, but are difficult to migrate to downstream tasks. The self-supervision learning method realizes the division of the examples by setting the agent task, thereby learning the global characterization feature which can describe the whole image. However, although global characterization features can be better migrated to downstream tasks than local semantic features, the global characterization features are rough and have weak capability of capturing fine-grained local features.

Therefore, the method for generating the pre-training AI model can extract the apparent features of the original image through the apparent feature extraction model with strong generalization capability, and then generate the pseudo label with higher fine granularity by combining the original label pre-stored in the original image based on the apparent features. The pseudo label has apparent information and artificial semantics, and the initial AI model is trained by adopting the original image carrying the pseudo label to obtain a pre-trained AI model. The pre-trained AI model obtained after training inherits the appearance characteristics with generalization capability and has the characteristic capture capability with more rich fine granularity. And then, the obtained pre-trained AI model can be migrated to downstream tasks, so that the performance of each downstream task is improved. The apparent feature extraction model can be obtained by training an original image which does not carry an original label.

The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application.

The scheme is described by taking the image as the original data and the training data, and certainly, in some cases, text data or voice data can be adopted for replacement, so that the pre-trained AI model obtained after training is used for executing the task of text recognition direction or voice recognition direction.

Fig. 3 is a diagram of a system framework for generating a pre-trained AI model according to an embodiment of the present disclosure.

As shown in fig. 3, the present application provides a system for generating a pre-trained AI model. The system described above first requires the acquisition of multiple raw images. The original image is input into the apparent feature learning module 301 for unsupervised contrast learning, and an apparent feature extraction model is obtained. It is understood that the original image adopted by the apparent feature learning module 301 in the unsupervised contrast learning may or may not carry the original label. When the original image carries the original label, the original label is not used in the process of carrying out the unsupervised contrast learning training. The trained apparent feature extraction model has good generalization capability and global apparent features. Thereafter, the apparent feature learning module 301 determines an apparent feature of each of the plurality of original images using the apparent feature extraction model, and inputs the apparent feature of each image into the pseudo tag generator 302. The pseudo label generator 302, in combination with the pre-configured original labels of each original image, generates one or more corresponding pseudo labels for each original image. When all the original images have one or more pseudo labels, the original images carrying the pseudo labels can be input into the local supervised contrastive learning module 303 for supervised contrastive learning, and a pre-trained AI model which has the generalization capability of global apparent features and also has semantic feature discrimination can be obtained. After the pre-training AI model provided by the application is obtained, targeted downstream task training can be performed by combining with downstream training data. I.e., combined with the downstream training data and the pre-trained AI model provided herein, are input to the downstream AI model training module 304. So as to train a corresponding downstream task detection AI model for a specific application scenario.

In some examples, the raw image is input to the apparent feature learning module 301 for unsupervised contrast learning, and the process of obtaining the apparent feature extraction model may be performed in advance. Therefore, after the above-mentioned processes are performed in advance, the original image, the original label corresponding to the original image, and the apparent feature of each image can be directly input into the pseudo label generator 302, so as to generate one or more corresponding pseudo labels for each original image.

In an example, the apparent feature learning module 301 may perform training by using an auto-supervised contrast learning method to obtain an apparent feature extraction model with better generalization capability, and it can be understood that the apparent feature extraction model can well describe the apparent content of the image. Of course, the apparent feature learning module 301 may also adopt other equivalent self-supervised learning methods, and the present application is not limited thereto.

In another example, pseudo label generator 302 further divides the original labels of the original images using the apparent features of each original image to generate higher granularity pseudo labels. Obviously, the generated pseudo label is embedded with the appearance characteristics and is fused with artificial prior semantic information.

In another example, the local supervised contrast learning module 303 utilizes the pseudo label with higher fine granularity and the contrast loss function generated by the pseudo label generator 302 for guidance, and performs full supervised contrast learning to obtain the pre-trained AI model of the present application. The pre-training AI model of the application can inherit a good apparent feature space through the pseudo label, and can adjust the local space of the apparent feature space through semantic information contained in the pseudo label, so that samples with similar apparent features and similar semantics are closer in distance, and samples with similar apparent features and different semantics are farther.

Obviously, the system for generating the pre-trained AI model according to the present application can be widely applied to downstream tasks such as classification, detection, segmentation, and the like, and therefore, the landing form of the system can be a built-in pre-trained AI model serving as a cloud service to help downstream users to improve the training speed and the final precision of various downstream tasks.

Fig. 4 is a flowchart of a method for generating a pre-trained artificial intelligence model according to an embodiment of the present disclosure.

As shown in fig. 4, the present application provides a method of generating a pre-trained AI model that may be run in the system shown in fig. 3. It can be understood that the method is mainly used for providing a pre-trained AI model which has both generalization capability and capability of providing feature capture with richer fine granularity, so that when a specific downstream recognition or classification task is performed, the downstream task can be better adapted to obtain a target AI model for a corresponding scene. Wherein, the target AI model is the dedicated AI model.

In one example, the method for generating the pre-trained AI model according to the present application may also be referred to as a supervised contrast learning method (SCAN) in the neighborhood. The above method may comprise the steps of:

s401, a plurality of original images are acquired.

A plurality of raw images are first acquired. It is to be understood that the plurality of acquired raw images may serve as pre-training data for pre-training the AI model.

S402, according to the plurality of original images and the apparent feature extraction model, determining the apparent feature corresponding to each original image.

And acquiring an apparent feature extraction model, and inputting a plurality of original images into the apparent feature extraction model to obtain an apparent feature corresponding to each original image.

In one example, the apparent feature extraction model may be pre-trained by the apparent feature learning module 301 described above. For example, prior to S402, the second initial AI model may be self-supervised learning trained from a plurality of raw images. It is understood that the original image used in the self-supervised learning training may be obtained in S401, and if the original image obtained in S401 carries the original label, the original label may be ignored in the training, that is, the original label of the original image is not used in the self-supervised learning training. Wherein the second initial AI model belongs to the initial AI model to which the present application relates. It is understood that the initial AI model referred to in this application may be an AI model of a network structure such as a mobile network (mobile network), a residual network (res network), and the like. Of course, other equivalent network structures may be adopted as the initial AI model, and the application is not limited herein.

In one example, a network structure with mobile net as the second initial AI model is shown in fig. 5, for example. It can be seen that the network structure of mobile net is shown in fig. 5. It is understood that fig. 5 is only one possible network structure of a mobile net. Among them, a 3 × 3 depth separable convolution layer (depthwise partial) 501, a first batch normalization layer (BN) 502, a first linear rectification function layer (ReLU) 503, a 1 × 1 convolution layer (convolution)504, a second BN layer 505, and a second ReLU layer 506 may be included. It may also be referred to as an activation function layer for the first ReLU layer 503 and the second ReLU layer 506.

In another example, such as shown in fig. 6, res net may also be used as the network structure of the second initial AI model. It can be seen that in the network structure of res net, the original image first passes through a 7 × 7 convolutional layer, and then is input into the residual block after pooling (pool). Where each two 3x3 convolutional layers can be considered as a residual block, the output for each residual block is the difference between the second layer convolutional layer output and the residual block input. It is understood that in other examples, each residual block may include more or less convolutional layers or other possible layers, and the size of the convolutional kernel of each convolutional layer may be arbitrarily set according to practical situations, and the values in the figures are only one possible example and the application is not limited thereto. Meanwhile, the number of the residual blocks can be set arbitrarily according to the actual situation.

Of course, in other examples, other equivalent or similar network structures may also be adopted as the second initial AI model, and the application is not limited herein. It is understood that the network structure used in the present application may include more or less layers, such as convolutional layers, pooling layers, fully-connected layers, and the like, and the present application is not limited thereto.

It can be understood that, when performing the self-supervised learning training on the second initial AI model, a plurality of original images may be input as input into the second initial AI model, and the self-supervised learning may be performed in a clustering manner. It can be understood that, at this time, if the original image subjected to the self-supervised learning training has the original label, the original label is used, and of course, the original image acquired in S401 may be specifically used, and of course, if the original image in S401 has the original label, the original label may be omitted, and the self-supervised learning training may be performed. In some examples, a comparative learning mode can be adopted for self-supervision learning training, and a characteristic feature space is constructed. Among them, contrast learning may also be referred to as momentum contrast method (MoCo). The encoder (encoder) in fig. 7 may be understood as the above-described second initial AI model, for example, as shown in fig. 7. And a momentum encoder (momentum encoder) is periodically maintained in the contrast learning. As can be seen from fig. 7, after the original image is input to the encoder (i.e., the second initial AI model), the output q is compared with the similarity of each vector in the memory bank queue. Each vector in the memory bank queue is a feature vector obtained by comparing the image through a momentum encoder. The output q is compared with the similarity of each vector in the memory bank queue one by one, and can be calculated by a contrast loss function (coherent loss), for example. And then q with lower similarity to any vector in the memory bank queue can be updated to the memory bank queue. If the number of vectors in the memory bank queue is preset, the vector with the closest similarity to q can be replaced when the memory bank queue is updated, and therefore the memory bank queue is updated. Of course, in some examples, a similarity threshold may be set as a threshold for determining whether q is similar to the vector in the memory bank queue. For example, a first similarity threshold may be set for the second initial AI model to output q for comparison with the memory base queue during the self-supervised learning training. In some examples, the similarity may be calculated by using a cosine function, an euclidean distance, a manhattan distance, and the like, which is not limited herein.

Obviously, the memory bank queue is introduced, so that the limitation of the video memory size of the video card is avoided, and a large number of pairs can be obtained in each batch processing process. And by continuously updating the memory base queue, a large number of images to be compared can be increased, and the sample characteristics in the memory base queue are updated in a momentum updating mode, so that the sample characteristics in the memory base queue are stable and consistent with the original image characteristic space.

By adopting the comparison learning mode, the network parameters in the encoder are continuously updated, so that the apparent feature extraction model obtained by final learning can have good global apparent features.

In one example, the apparent feature extraction model can include an apparent feature space mapping function f (-) and an apparent feature space momentum mapping function g (-). Where f (-) is the encoder mentioned in the above comparative learning, and g (-) is the momentum encoder mentioned in the above comparative learning.

In some scenarios, most cases do this only, and then use the apparent feature extraction model as the final pre-trained AI model. However, it is obvious that the apparent feature extraction model obtained at this time is only a global apparent feature with strong generalization capability, and the original image to be trained does not have the guidance of the original label, so the obtained apparent feature extraction model cannot identify semantic information with fine granularity, which results in very insufficient feature richness of the apparent feature extraction model.

In other schemes, the contrast learning method is directly applied to the fully supervised learning training, and a contrast loss function is used to replace the traditional normalized exponential loss function (or softmax loss function). For example, a Supervised Contrast Learning (SCL) method is shown in fig. 8. In the first stage, the positive and negative pairs in the contrast loss function are constructed from the original tags carried in the original picture. For example, by performing data enhancement twice on two different original images, such as 2048-dimensional (dimension) data enhancement first, and then 128-dimensional data enhancement. I.e., 2048-D and 128-D as shown. It is understood that, of course, the dimensionality of the original image during data enhancement may be arbitrarily set according to the actual situation, and the number of times of data enhancement may also be arbitrarily set according to the actual situation, which is not limited in the present application. When the positive and negative pairs in the contrast loss function are constructed, normal supervised learning is performed on the main neural network, for example, network parameters in a Full Connected (FC) layer in the neural network structure are trained through cross entropy (cross entropy), and normalization is performed through a 1000-D softmax loss function layer, so that a trained pre-trained AI model is obtained. However, it is obvious that the pre-trained AI model learned in the SCL manner only improves the performance of a specific target task, and if the label in the original image is trained to be a dog, the pre-trained AI model can only improve the performance of identifying the dog, but cannot improve the performance of identifying other objects such as a cat, a table, a horse, and the like. That is to say, the deep semantics learned by the pre-trained AI model trained by the supervised contrast learning method is still a biased semantics. The generated pre-trained AI model is seriously overfitted to the target task, so that the pre-trained AI model cannot be effectively popularized in a downstream task, and the performance cannot be guaranteed.

Therefore, in order to ensure that the pre-trained AI model can be better transferred to a downstream task, the apparent feature extraction model is obtained by adopting self-supervision contrast learning training, and the apparent feature extraction model loosely coupled with the task is obtained by utilizing a self-supervision learning method. The obtained apparent feature extraction model can describe the global information of the whole image, thereby realizing instance-level classification. And then, obtaining the apparent characteristics of the original image through an apparent characteristic extraction model, and adding a pseudo label to the original image. Therefore, in some examples, after obtaining the apparent feature extraction model, each original image in S401 needs to be input into the apparent feature extraction model, so as to obtain the apparent feature corresponding to each original image. For example, if the apparent feature extraction model is obtained by contrast learning, the apparent feature of the original image, i.e., the original image x, can be represented by f (·)_iIs apparent characteristic of f (x)_i). It is to be appreciated that the above process can be accomplished in the apparent feature learning module 301.

And S403, determining a pseudo label of each original image according to the plurality of apparent features and the original labels preset in the original images.

After obtaining the apparent feature f (·) of each original image in S402, the apparent feature learning module 301 inputs the apparent feature f (·) of each original image into the pseudo label generator 302, so that the pseudo label generator 302 determines the pseudo label of each original image by combining the original labels pre-configured for each original image. It will be appreciated that each original image may have one or more pseudo-tags.

In one example, the pseudo labels of the respective original images may be constructed using the similarity between the different original images. For example, one original image x may be selected from the plurality of original images acquired in S401_iAnd determining the original image x_iApparent characteristic f (x)_i) Apparent feature f (x)_i) May be written as the first apparent feature. The apparent features of all other original images remaining are then determined, and the apparent feature of each other original image can be noted as the second apparent feature. It will be appreciated that all divisions to the original image x_iAll of the remaining original images of (b) may be referred to as original image x_jThen the second apparent feature can be expressed as f (x)_j). Thereafter, a similarity of the first apparent feature to the second apparent feature is determined. The similarity of the first apparent feature to the second apparent feature is calculated, for example, by equation 1.

similarity(x_i，x_j)＝f(x_i)*f(x_j) … … equation 1

Wherein, similarity (x)_i，x_j) For representing an original image x_iWith the original image x_jThe similarity between them.

In one example, an original image x may be employed_iWith the original image x_jCosine distance therebetween as the original image x_iWith the original image x_jThe similarity between them. Of course, in other examples, any equivalent manner such as euclidean distance, manhattan distance, etc. may be used instead, and the present application is not limited thereto. Pseudo label generator 302 mayTo obtain the result according to similarity (x)_i，x_j) The original image x_iAnd similarity (x)_i，x_j) Original image x satisfying preset conditions_jThe same pseudo label is set.

Of course, in some instances, to reduce the amount of computation and to avoid situations where the apparent features are similar but the semantics are completely different. In determining the original image x_jIt may be determined according to a preset original tag. For example, all original images with the same original label are first treated as the same group, and then a plurality of original images are divided into at least one group according to the original label. Then, based on each group, any one of the original images is selected as an original image x_iAnd using the rest of the original images in the packet as original images x_jAnd calculating the similarity between the apparent features of different original images.

In one example, the preset condition that is satisfied may be, for each original image x_iDetermining each original image x_jWith the original image x_iSimilarity (x) between them_i，x_j). And according to similarity (x)_i，x_j) Selecting k original images x according to a preset rule_jWherein k is a positive integer. The original image x_jAnd k original images x_jThe same pseudo label is determined. In some examples, the preset rule may be based on similarity (x)_i，x_j) Of one or more original images x in sequence from large to small_jAnd (6) sorting. And the original image x_iAnd similarity (x)_i，x_j) The top k original images x in the sequence of (1)_jThe same pseudo label is determined. Of course, k original images x can be selected according to any other possible preset rules_jE.g. randomly selected, or k original images x calculated according to a predetermined formula_jThe present application is not limited thereto. In the way, each original image in the same group is taken as the original image x_iPerforming calculation, and finally, each original image in the packet carries at least one pseudo label. It can be understood that, after all the original images acquired in S401 are determined in the above manner, each original image acquired in S401 will carry at least one pseudo tag.

In another example, the preset condition that is met may be that a similarity threshold is preset, for example, a second similarity threshold may be set. And will similarity (x)_i，x_j) All original images x greater than or equal to the second similarity threshold_jAnd original image x_iThe same pseudo label is determined. Of course, to avoid having too many original images with the same pseudo label, in some examples, similarity (x) may be used_i，x_j) Partial original image x greater than or equal to second similarity threshold_jAnd original image x_iThe same pseudo label is determined. For example, one or more similarity (x) values are determined_i，x_j) Original image x greater than or equal to second similarity threshold_jAnd according to similarity (x)_i，x_j) Of one or more original images x in sequence from large to small_jAnd (6) sorting. Then k original images x are selected according to a preset rule_j. For example, the preset rule may be to select the top k original images x according to the sorting order_j. Of course, it is also possible to derive one or more original images x from any other possible preset rules_jTo select k original images x_jE.g. randomly selected, or k original images x calculated according to a predetermined formula_jThe present application is not limited thereto.

It will be appreciated that in some examples, if the value of k is preconfigured, and similarity (x)_i，x_j) Original image x greater than or equal to second similarity threshold_jIf the number of (c) is less than k, all similarity (x) can be calculated_i，x_j) Original image x greater than or equal to second similarity threshold_jAnd original image x_iThe same pseudo label is determined.

By determining the pseudo labels in the above manners, it can be understood that the determined pseudo labels are classified according to the original labels, and a plurality of different pseudo labels are determined for each classification. Therefore, the pseudo labels determined by the method are subclasses of the corresponding original labels. For example, when the original tag is a dog, then the pseudo tags under the corresponding category may be dog 1, dog 2, dog 3, etc. Meanwhile, since the same original image may have a higher similarity with different original images, each original image may have one or more different pseudo labels.

S404, training the first initial AI model by using the original image carrying the pseudo label to obtain a pre-training AI model.

After the pseudo tag generator 302 determines at least one pseudo tag for each original image, the local supervised contrast learning module 303 may train the first initial AI model by using the original image carrying the pseudo tag, and obtain a pre-trained AI model that can have both generalization capability of global apparent features and semantic feature discrimination capability. In one example, training the first initial AI model using the raw image carrying the pseudo-label may be performing a fully supervised contrast learning training. It is to be understood that the first initial AI model is also the initial AI model described herein.

In one example, the network structure described in the above S402 may be employed as the first initial AI model. And taking the original image carrying the pseudo label as the input of the first initial AI model, and carrying out full-supervision comparative learning by adopting a comparative learning mode. It can be appreciated that since the original image input at this time carries a pseudo label, the first initial AI model can be trained using fully supervised contrast learning. Due to the contrast learning approach, the first initial AI model can include a mapping function f '(-) corresponding to the encoder and a momentum mapping function g' (-) corresponding to the momentum encoder. And adjusting the network parameters of the first initial AI model by the neighbor adjustment contrast loss function described in equation 2, and continuously iterating f '(. cndot.) and g' (. cndot.).

Wherein, the formula 2 can be shown as follows,

wherein the content of the first and second substances,

indicating the degree of similarity between any one original image i and the other original images. x is the number of_iAnd x_jRepresenting different original images, i.e. original image x_iAnd original image x_j。

Denotes x_iThe corresponding pseudo-tag is set to be,

denotes x_jThe corresponding pseudo label.

Represents the pair x_iThe data enhancement is performed once and for all,

represents a pair x_jCarry out and

different data enhancements. It is understood that the manner of performing data enhancement can be calculated in any manner known in the art, and the present application is not limited thereto. Tau represents a temperature function, is a preset hyper-parameter and is used for smoothing in the first initial AI model training process. b is b training images selected from a plurality of original images carrying pseudo labels in advance and used as a batch of training samples, and b is a positive integer. b (K +1) indicates that the remaining K original images with the same pseudo label determined according to each original image in the b original images are obtained. And performing full-supervised contrast learning training based on b (K +1) original images.

For the

Then as can be shown in equation 3,

where t is used to represent an original image x that is a different pseudo-label than the original image i_tThen, then

Denotes x_tThe corresponding pseudo-tag is set to be,

represents a pair x_tCarry out and

different data enhancements. z is a radical of_lThe method is used for representing the first characteristic in a memory bank queue in the comparative learning, wherein the memory bank queue comprises L characteristics in total, and L is a positive integer.

The present application iterates through equations 2 and 3 above to f '(. cndot.) and g' (. cndot.) so that images with the same pseudo label and similar apparent features are closer together, and images with different pseudo labels but similar apparent features are further apart. It is understood that, in comparative learning, g ' (. cndot. +1) · 1-M · g ' (. cndot.) + M · f ' (. cndot.). Wherein M is a preset momentum parameter. Thus, the network parameters of g '(. cndot.) can be dynamically iterated by continually adjusting the network parameters of f' (. cndot.). And the iterative update of the f '(. cndot.) network parameter can be subjected to back propagation calculation according to the output of the contrast loss function, so that the iterative update of the f' (. cndot.) is realized. Of course, in other examples, the network parameter of f' (·) may be iteratively updated in any equivalent manner, which is not limited herein.

Fig. 9 is a flowchart of a method for applying a pre-trained AI model according to an embodiment of the present disclosure.

As shown in fig. 9, the present application further provides a method for operating a pre-trained AI model, that is, after obtaining a pre-trained AI model that can have both generalization capability of global apparent features and semantic feature discrimination capability at S404, the method can perform adaptation of a downstream task based on the obtained pre-trained AI model. Therefore, after S404, the method further comprises:

s901, a plurality of training data are obtained.

A plurality of training data for respective application scenarios is acquired. Wherein the application scenarios of the plurality of training data may be the same.

S902, training the pre-trained AI model by adopting a plurality of training data, and determining a target AI model aiming at a corresponding application scene.

A plurality of training data having the same application scenario acquired in S901 are used as input and input to the pre-trained AI model obtained in S404. And training in a full-supervised learning mode to finally obtain a target AI model corresponding to the application scene aiming at the training data so as to facilitate the target AI model to execute the corresponding target task in the application scene. It can be understood that, since the pre-trained AI model obtained in S404 has the generalization capability of the global apparent feature and can also have the semantic feature discrimination capability, the method shown in fig. 9 can be perfectly compatible with various different downstream tasks, and a targeted AI model is obtained.

In the manner described in fig. 1 to 9, the problems of the self-supervised learning or the fully-supervised learning as the schemes for generating the pre-trained AI model are improved. Firstly, a good apparent feature space can be obtained by learning a large-scale original image in a self-supervision learning mode and the like. And then generating a finer pseudo label according to the distance of the original image in the apparent feature space and the original label corresponding to the original image. And then, using a contrast loss function as a guide, and carrying out full-supervised contrast learning under the supervision of the pseudo label to obtain a better pre-training AI model. The obtained pre-training AI model not only has the global appearance characteristics obtained by the self-supervision algorithm learning, but also has semantic knowledge from the original label. Thus, the final learned features will be more rich and will be more conducive to training with downstream tasks. And then, the obtained pre-trained AI model can be used as an initialization parameter of a downstream task to accelerate the convergence rate and the final model precision of the downstream task.

The learning method for efficiently generating the pre-training AI model based on the label comprehensively considers the advantages and the disadvantages of the existing self-supervision and full-supervision algorithms as the pre-training AI model training scheme, effectively enhances the apparent feature expression capability and the generalization of the final pre-training AI model, and improves the performance of each downstream task. For example, in the global apparent feature learning stage, a relatively good apparent feature space can be initialized by using a correlation self-supervision learning method, and then a finer pseudo label can be generated according to the relative position of the original image in the apparent feature space and the original coarse original label. The pseudo-label contains both the original semantic information and the spatial flow pattern of the appearance features. And in the local supervision comparison fine tuning stage, full supervision comparison learning can be carried out under the guidance of the pseudo label. Therefore, the finally learned characteristics of the pre-trained AI model can better restore the apparent characteristic structure of the previous stage, and can also carry out fine adjustment on local parts, so that images with more similar semantics are gathered together, and images with different semantics are zoomed out. Therefore, the finally learned pre-trained AI model can have the generalization capability of the global appearance characteristics and the discrimination capability of the semantic characteristics.

It is understood that the methods described in fig. 1 to 9 are described by taking the image field as an example, and of course, if the data is replaced by voice data or text data, the methods can also be applied to the voice field, the text semantic field, and the like. The type of training data is not set in the present application.

Compared with the self-supervision algorithm, the method and the device have the advantages that effective pre-training knowledge is extracted from large-scale label-free data, the pseudo labels are determined by combining semantic knowledge provided by original labels better, and accordingly the capability of a final pre-training AI model is further improved. Different from a full supervision algorithm in the aspect of improving a single specific task, the method and the system can extract knowledge with stronger generalization capability, so that the knowledge can be better converted into a downstream task.

Fig. 10 is a schematic diagram of an apparatus for generating a pre-trained AI model according to an embodiment of the present disclosure, and it should be understood that the division of units in the apparatus for generating a pre-trained AI model in fig. 10 is merely exemplary, and may be divided differently.

As shown in fig. 10, the present application provides an apparatus 1000 for generating a pre-trained AI model, the apparatus 1000 comprising: the acquisition unit 1001 may be configured to acquire a plurality of original images. A pseudo label generating unit 1002, configured to determine apparent features corresponding to the plurality of original images according to the plurality of original images and the apparent feature extraction model; and determining a pseudo label of each original image according to the apparent feature of each original image and the preset original label of each original image. The first training unit 1003 is configured to train the first initial AI model by using a plurality of original images carrying pseudo labels, so as to obtain a pre-training AI model. In one example, the first training unit 1003 is configured to perform a fully supervised contrast learning training on the first initial AI model.

In one possible implementation, the pseudo tag generating unit 1002 is further configured to: for each raw image, a similarity of the first apparent feature to the second apparent feature is determined. The first apparent characteristic is the apparent characteristic of the original image, and the second apparent characteristic is the apparent characteristic of at least one other original image which has the same original label with the original image in the plurality of original images. And determining the pseudo labels of the original image and other original images corresponding to the second apparent features of which the similarity with the first apparent features meets the preset condition.

In one possible implementation, the pseudo tag generating unit 1002 is further configured to: and determining the pseudo labels of the original image and the k other original images according to the similarity of the first apparent feature and the second apparent feature. In one example, k is a positive integer.

In one possible implementation, the pseudo tag generating unit 1002 is further configured to: and sequencing at least one other original image according to the similarity of the first apparent feature and the second apparent feature, and determining the pseudo labels of the original image and the first k other original images in the sequence.

In one possible implementation, the pseudo tag generating unit 1002 is further configured to: and determining other original images with the similarity greater than or equal to the similarity threshold according to the similarity of the first apparent feature and the second apparent feature. And determining the pseudo labels of the original image and other original images with the similarity larger than or equal to the similarity threshold value.

In one possible implementation, the pseudo tag generating unit 1002 is further configured to: and if the number of the other original images with the similarity larger than or equal to the similarity threshold is larger than or equal to k, determining k other original images according to a preset rule. In one example, k is a positive integer. The pseudo-labels of the original image and k other original images are determined.

In one possible implementation, the pseudo tag generating unit 1002 is further configured to: for each raw image, a first apparent feature and a second apparent feature are determined from the apparent feature extraction model. And determining cosine distances of the first apparent feature and the second apparent feature, and taking the cosine distances as the similarity of the first apparent feature and the second apparent feature.

In one possible embodiment, the first training unit 1003 is further configured to: and taking a plurality of original images carrying the pseudo labels as input of the first initial AI model, and iterating network parameters in the first initial AI model by adopting a contrast loss function to obtain a pre-training AI model.

In one possible embodiment, the apparatus 1000 further comprises: and a second training unit 1004, configured to perform self-supervised learning training on the second initial AI model in advance by using a plurality of original images, and determine an apparent feature extraction model.

In one possible implementation, the obtaining unit 1001 is further configured to: a plurality of training data under an application scene are acquired. The apparatus 1000 further comprises: a third training unit 1005, configured to train the pre-trained AI model with a plurality of training data, and determine a target AI model for the application scenario.

According to the method, the apparent features of the original image are extracted through the apparent feature extraction model with strong generalization capability, and then the pseudo label is generated based on the apparent features and the original label pre-stored in the original image. The pseudo label has apparent information and artificial semantics, so that a pre-training AI model obtained by training an original image carrying the pseudo label inherits the apparent characteristics with generalization capability and has characteristic capture capability with richer fine granularity.

As shown in fig. 11, the present application provides an apparatus 1100 for generating a pre-trained AI model, and the apparatus 1100 may include: a processor 1101, a memory 1102, a communication interface 1103, and a bus 1104. Of course, it should be understood that the illustrated structure of the embodiments of the present application does not limit the apparatus 1100 specifically. In other embodiments of the present application, apparatus 1100 may include more or fewer components than illustrated, or some components may be combined, some components may be split, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware, which is not limiting in this application.

The processor 1101 may be an advanced reduced instruction set (ARM) processor, an X86 processor, a microprocessor without an internal interlocked pipeline (MIPS) processor, or other architectures. Processor 1101 may include one or more processing units, such as: the processor 1101 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor and/or a neural-Network Processor (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors.

The memory 1102 may include memory. The memory may be used to store computer-executable program code, which includes instructions. The memory may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a Universal Flash Storage (UFS), and the like. The processor 1101 performs various functional applications and data processing of the device 1100 by executing instructions stored in the memory and/or instructions stored in the memory provided in the processor. Of course, in other examples, the memory 1102 may also include an auxiliary memory or be referred to as an external memory, such as a non-removable memory or a removable memory card, etc.

The communication interface 1103 may be used to obtain raw images or raw training data from other devices, either wired or wirelessly.

The processor 1101, memory 1102 and communication interface 1103 are connected by a bus 1104 for interacting with data.

When the processor 1101 in the apparatus 1100 provided by the present application runs the instructions stored in the memory 1102, any function in the examples described in fig. 1 to 9 may be implemented, and for a specific implementation manner, reference may be made to the corresponding description in fig. 1 to 9, which is not described herein again.

The scheme involved in the application is a scheme for improving the capacity of the pre-trained AI model by utilizing the additional semantic information provided by the label. Aiming at the characteristics that the global feature generalization ability of self-supervision extraction is strong but the semantic detail description ability is weak, and the semantic feature expression ability of full supervision extraction is strong but the generalization ability is weak, the learning scheme for generating the pre-trained AI model based on efficient label utilization is provided, so that the full supervision learning and self-supervision learning characteristics are effectively combined, and the ability of the pre-trained AI model is better improved. For example, a global characterization space with better generalization capability can be obtained by an auto-supervised learning method, and then each local region of the characterization space is adjusted by using a pseudo label for guidance, so that the samples belonging to the same class are closer to each other, and the non-similar but apparently similar sample pairs are farther from each other. Finally, the distance of the same-class and globally similar sample pairs is smaller than that of non-same-class and globally similar sample pairs on the premise of not destroying the global characterization space. Therefore, the finally learned pre-trained AI model can show stronger generalization capability like an automatic supervision learning method, and meanwhile, the fine-grained semantic feature expression is richer, and the pre-trained AI model can be better migrated to a downstream task.

According to the method and the device, the appearance features extracted by self-supervision and the semantic features provided by the original labels are fully mined, so that the richness of the features of the pre-trained AI model is enhanced. And achieves a level of industry lead on the public data set of each downstream task.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application also provides a computer readable storage medium. It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

Embodiments of the present application also provide a computer program product comprising one or more computer instructions. When loaded and executed on a computing device, cause the processes or functions described in accordance with embodiments of the application to occur, in whole or in part.

The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, or data center to another website site, computer, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.).

When the computer program product is executed by a computer, the computer executes the method of the previous method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case it is desired to use the method as described above.

The description of the flow or structure corresponding to each of the above drawings has emphasis, and a part not described in detail in a certain flow or structure may refer to the related description of other flows or structures.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of generating a pre-trained artificial intelligence, AI, model, the method comprising:

acquiring a plurality of original images;

determining apparent features of the plurality of original images according to the plurality of original images and an apparent feature extraction model;

determining a pseudo label of each original image according to the apparent feature of each original image and an original label preset by each original image;

and training the first initial AI model by adopting the plurality of original images carrying the pseudo labels to obtain a pre-trained AI model.

2. The method of claim 1, wherein determining the pseudo label of each original image according to the apparent feature of each original image and the preset original label of each original image comprises:

for each original image, determining similarity of a first apparent feature and a second apparent feature, wherein the first apparent feature is the apparent feature of the original image, and the second apparent feature is the apparent feature of at least one other original image which has the same original label as the original image in the plurality of original images;

and determining the pseudo labels of the original image and other original images corresponding to second apparent features of which the similarity with the first apparent features meets preset conditions.

3. The method of claim 2, wherein the determining the pseudo label of the original image and other original images corresponding to second apparent features whose similarity to the first apparent features satisfies a preset condition comprises:

and determining the pseudo labels of the original image and k other original images according to the similarity of the first apparent feature and the second apparent feature, wherein the k other original images are the first k other original images with high similarity in the similarity sequence or other original images with similarity larger than a preset threshold, and k is a positive integer.

4. The method according to claim 2 or 3, wherein the original image and the pseudo label of the other original image corresponding to the second apparent feature whose similarity to the first apparent feature satisfies a preset condition are the same, wherein the pseudo label is a subclass of the original label.

5. The method of any one of claims 2-4, wherein determining the similarity of the first apparent feature to the second apparent feature comprises:

for each raw image, determining the first apparent feature and the second apparent feature according to the apparent feature extraction model;

determining cosine distances of the first apparent feature and the second apparent feature, and taking the cosine distances as the similarity of the first apparent feature and the second apparent feature.

6. The method according to any one of claims 1 to 5, wherein the training of the first initial AI model using the plurality of original images carrying the pseudo-label to obtain a pre-trained AI model comprises:

and taking the original images carrying the pseudo labels as input of the first initial AI model, and performing iterative updating on network parameters in the first initial AI model by adopting a contrast loss function to obtain the pre-training AI model.

7. The method of any of claims 1-8, wherein the apparent feature extraction model is determined for an unsupervised learning training of the second initial AI model using a plurality of pre-acquired images.

8. The method of any one of claims 1-9, further comprising:

acquiring a plurality of training data under an application scene;

and training the pre-trained AI model by adopting the plurality of training data, and determining a target AI model aiming at the application scene.

9. An apparatus for generating a pre-trained artificial intelligence model (AI), the apparatus comprising:

an acquisition unit configured to acquire a plurality of original images;

a pseudo label generation unit for determining apparent features of the plurality of original images according to the plurality of original images and an apparent feature extraction model; and the number of the first and second groups,

and the first training unit is used for training the first initial AI model by adopting the plurality of original images carrying the pseudo labels to obtain a pre-training AI model.

10. The apparatus of claim 9, wherein the pseudo tag generation unit is further to:

and determining the pseudo labels of the original image and the other original images corresponding to the second apparent features of which the similarity of the first apparent features meets the preset condition.

11. The apparatus of claim 10, wherein the pseudo tag generation unit is further to:

and determining the pseudo labels of the original image and the k other original images according to the similarity of the first apparent feature and the second apparent feature, wherein the k other original images are the first k other original images with high similarity in the similarity sequence or other original images with similarity larger than a preset threshold, and k is a positive integer.

12. The apparatus according to claim 10 or 11, wherein the original image and the pseudo label of the other original image corresponding to the second apparent feature whose similarity to the first apparent feature satisfies a preset condition are the same, wherein the pseudo label is a subclass of the original label.

13. The apparatus of any of claims 10-12, wherein the pseudo tag generation unit is further to:

14. The apparatus of any of claims 9-13, wherein the first training unit is further configured to:

and taking the plurality of original images carrying the pseudo labels as input of the first initial AI model, and performing iterative updating on network parameters in the first initial AI model by adopting a contrast loss function to obtain the pre-training AI model.

15. The apparatus of any one of claims 9-14, further comprising:

and the second training unit is used for carrying out self-supervision learning training on the second initial AI model by adopting a plurality of pre-acquired images and determining the apparent feature extraction model.

16. The apparatus of any of claims 11-19, wherein the obtaining unit is further configured to:

acquiring a plurality of training data under an application scene;

the device further comprises:

and the third training unit is used for training the pre-trained AI model by adopting the plurality of training data and determining a target AI model aiming at the application scene.

17. A computing device, comprising:

a processor and a memory;

the computing device performs the method of any of claims 1-8 when the processor reads and executes instructions stored in the memory.

18. A computer-readable storage medium having stored therein instructions that, when executed on a computing device, cause the computing device to perform the method of any of claims 1-8.