CN114092759A

CN114092759A - Training method and device of image recognition model, electronic equipment and storage medium

Info

Publication number: CN114092759A
Application number: CN202111256246.1A
Authority: CN
Inventors: 张国生
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-02-25

Abstract

The present disclosure provides a training method and apparatus for an image recognition model, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, specifically to the technical fields of deep learning, computer vision, etc., and can be applied in image processing and image recognition scenes. The method comprises the following steps: the method comprises the steps of obtaining a plurality of sample images and corresponding modes, determining labeling relation characteristics corresponding to the sample images according to the modes, labeling identification information to train an initial image identification model according to the sample images and the labeling relation characteristics to obtain a target image identification model, effectively utilizing multi-mode advantages of the images, and achieving model training by taking correlation characteristics between different modes of the sample images as labeling data, so that the trained target image identification model can effectively learn and model relationships between different modes of the images, the identification performance of the target image identification model is effectively improved, and the identification effect of the target image identification model is improved.

Description

Training method and device of image recognition model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, computer vision, and the like, which can be applied in image processing and image recognition scenarios, and in particular, to a method and an apparatus for training an image recognition model, an electronic device, and a storage medium.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

In the related art, in the training process of the image recognition model, the multi-modal advantages of the image are not fully utilized, so that the image recognition performance of the image recognition model obtained through training is poor.

Disclosure of Invention

The disclosure provides a training method of an image recognition model, an image recognition method, an apparatus, an electronic device, a storage medium and a computer program product.

According to a first aspect of the present disclosure, there is provided a training method of an image recognition model, including: acquiring a plurality of sample images, wherein the plurality of sample images respectively correspond to a plurality of modalities, and the plurality of sample images are correspondingly marked with identification information; determining a plurality of labeling relation features respectively corresponding to the plurality of sample images according to the plurality of modalities, wherein the labeling relation features describe the association between the modalities of the corresponding sample images and the modalities of other sample images, and the corresponding sample images and the other sample images jointly form the plurality of sample images; and training an initial image recognition model according to the plurality of sample images, the plurality of labeling relation characteristics and the labeling recognition information to obtain a target image recognition model.

According to a second aspect of the present disclosure, there is provided an image recognition method including: acquiring a plurality of images to be identified, wherein the images to be identified respectively correspond to a plurality of modalities; and respectively inputting the images to be recognized into the target image recognition models obtained by training according to the training method of the image recognition model of the first aspect of the disclosure, so as to obtain the target recognition information output by the target image recognition models.

According to a third aspect of the present disclosure, there is provided a training apparatus for an image recognition model, comprising: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of sample images, the plurality of sample images respectively correspond to a plurality of modalities, and the plurality of sample images correspond to label identification information; a determining module, configured to determine, according to the multiple modalities, multiple annotation relation features corresponding to the multiple sample images, where the annotation relation features describe association between a modality of a corresponding sample image and modalities of other sample images, and the corresponding sample image and the other sample images together form the multiple sample images; and the training module is used for training an initial image recognition model according to the plurality of sample images, the plurality of labeling relation characteristics and the labeling recognition information so as to obtain a target image recognition model.

According to a fourth aspect of the present disclosure, there is provided an image recognition apparatus comprising: the second acquisition module is used for acquiring a plurality of images to be identified, wherein the images to be identified respectively correspond to a plurality of modalities; and the input module is used for respectively inputting the images to be recognized into the target image recognition models obtained by training the training device of the image recognition model according to the third aspect of the disclosure so as to obtain the target recognition information output by the target image recognition models.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training an image recognition model according to the first aspect of the disclosure or to perform an image recognition method according to the second aspect of the disclosure.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the training method of an image recognition model according to the first aspect of the present disclosure or to perform the image recognition method according to the second aspect of the present disclosure.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the steps of the training method of the image recognition model according to the first aspect of the present disclosure, or performs the steps of the image recognition method according to the second aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an image recognition model provided according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 9 illustrates a schematic block diagram of an example electronic device to implement the training method of the image recognition model of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure.

It should be noted that an execution subject of the training method for the image recognition model in this embodiment is a training device for the image recognition model, the device may be implemented in a software and/or hardware manner, the device may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

The embodiment of the disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to image processing and image recognition scenes.

Wherein, Artificial Intelligence (Artificial Intelligence), english is abbreviated as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.

Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds.

Computer vision, which means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eye observation or transmitted to an instrument for detection.

The image processing and image recognition scene can be, for example, a hardware device or a software computing processing logic is adopted to recognize the image to be processed so as to recognize and obtain corresponding image features, and the image features are adopted to assist subsequent detection application.

As shown in fig. 1, the training method of the image recognition model includes:

s101: and acquiring a plurality of sample images, wherein the plurality of sample images correspond to a plurality of modalities respectively, and the plurality of sample images correspond to the label identification information.

The image used for training the image recognition model may be referred to as a sample image, and the sample image may be captured by an imaging device having an imaging function, such as a mobile phone or a camera, or may be analyzed from a video, for example, the sample image may be a partial frame video image extracted from a plurality of video frames included in the video, which is not limited to this.

The plurality of sample images may respectively correspond to a plurality of modalities, for example, the plurality of sample images may respectively correspond to a three primary colors (Red Green Blue, RGB) modality, a Depth (Depth) modality, an Infrared (Infrared Radiation, IR) modality, and the like, which is not limited thereto.

That is to say, in the embodiment of the present disclosure, the multiple sample images may be obtained by obtaining an RGB image, a Depth image, and an IR image, and taking the obtained images together as the sample image, which is not limited in this respect.

For example, the obtaining of the plurality of sample images may be by using a visible light camera, an infrared camera (IR camera), and a depth camera, and the face images of different modalities are respectively collected for the sample faces, and the collected face images are collectively used as the sample images, which is not limited to this.

That is to say, an application scenario of the embodiment of the present disclosure may be, for example, acquiring a face image with multiple modalities, and then training to obtain a corresponding face image recognition model according to the acquired face images with multiple modalities, where the face image recognition model may be used to perform face live detection, which is not limited in this regard.

It should be noted that the face image in the embodiment of the present disclosure, or any other image that may relate to user information, is obtained after authorization by the relevant user, and the obtaining process thereof all complies with the regulations of the relevant laws and regulations, and does not violate the good customs of the public order.

In the training process of the image recognition model, the reference recognition information used for determining the time when the model converges (i.e., whether the model reaches the standard) may be referred to as labeled recognition information, and the labeled recognition information may specifically be, for example, image features of different dimensions, such as features of semantic dimension, features of color dimension, features of brightness dimension, and the like, without limitation.

For example, a local image region (for example, a facial region, a skin region, and a facial contour region) may be identified from the obtained face image, and then, the local image region may be subjected to image analysis to determine facial information, and the like as the annotation identification information, which is not limited herein.

S102: and determining a plurality of labeling relation characteristics respectively corresponding to the plurality of sample images according to the plurality of modalities, wherein the labeling relation characteristics describe the association between the modality of the corresponding sample image and the modalities of other sample images, and the corresponding sample image and the other sample images jointly form the plurality of sample images.

After the plurality of sample images are obtained, the plurality of annotation relation characteristics respectively corresponding to the plurality of sample images can be determined according to the plurality of modalities respectively corresponding to the plurality of sample images.

Among the plurality of sample images, the sample image whose labeling relationship characteristic is to be determined at present may be referred to as a corresponding sample image, and accordingly, the sample images other than the corresponding sample image among the plurality of sample images may be referred to as other sample images.

The annotation relation feature may be used to describe a correlation between a modality of the corresponding sample image and a modality of another sample image, where the correlation may specifically be, for example, a correlation between a modality of the sample image and a modality of another sample image, a correlation between semantic dimensions, and the like, which is not limited herein.

For example, the association between the modality of the corresponding sample image and the modality of the other sample image may specifically be a similarity between different modalities, an euclidean distance between semantic vectors represented by different modalities, or may also be an association in any other possible form, which is not limited to this.

That is to say, in the embodiment of the present disclosure, after determining the multiple modalities corresponding to the multiple sample images, the multiple annotation relation features corresponding to the multiple sample images may be determined according to the multiple modalities, so that the association features between the different modalities of the sample images are used as annotation data to perform model training, and the trained target image recognition model can effectively learn and model the relation between the different modalities of the images.

In some embodiments, according to multiple modalities, multiple annotation relation features corresponding to multiple sample images are determined, where a sample image whose annotation relation feature is to be determined currently is determined from the multiple sample images, then, other sample images corresponding to the sample image are determined, the modalities of the corresponding sample image and the other sample images are determined, association features between the multiple modalities are analyzed, the association features are used as the annotation relation features of the sample image, and an annotation relation feature corresponding to each sample image can be determined by using a corresponding processing method.

S103: and training an initial image recognition model according to the plurality of sample images, the plurality of labeling relation characteristics and the labeling recognition information to obtain a target image recognition model.

After determining the multiple labeling relationship features respectively corresponding to the multiple sample images according to the multiple modalities, the initial image recognition model can be trained according to the multiple sample images, the multiple labeling relationship features and the labeling recognition information to obtain the target image recognition model.

The image recognition model obtained at the initial stage of training may be referred to as an initial image recognition model, and the initial image recognition model may be an artificial intelligence model, specifically, for example, a neural network model or a machine learning model, or of course, any other possible artificial intelligence model capable of executing an image recognition task may be adopted, which is not limited thereto.

For example, a plurality of sample images, a plurality of annotation relation features, and annotation identification information may be input into the initial image recognition model to obtain predicted identification information output by the initial image recognition model, and if the convergence time between the predicted identification information and the annotation identification information is satisfied, it is determined that the face recognition model satisfies a certain convergence condition, and the face image recognition model obtained by training may be used as the target face image recognition model.

In some embodiments, a loss function may be configured in advance for the initial image recognition model, in the process of training the initial image recognition model, the predicted recognition information and the labeled recognition information are used as input parameters of the loss function, a loss value output by the loss function is determined, then the loss value is compared with a set loss threshold value to determine whether the image recognition model meets the convergence time (if meeting the convergence time, the model may be indicated to converge), and if the model is determined to converge, the trained image recognition model may be used as the target image recognition model.

In the embodiment, by obtaining a plurality of sample images, wherein the plurality of sample images correspond to a plurality of modalities respectively, the plurality of sample images correspond to the annotation identification information, and according to the plurality of modalities, a plurality of annotation relation characteristics corresponding to the plurality of sample images are determined respectively, wherein the annotation relation characteristics describe the association between the modality of the corresponding sample image and the modalities of other sample images, the corresponding sample image and other sample images jointly form the plurality of sample images, and an initial image identification model is trained according to the plurality of sample images, the plurality of annotation relation characteristics, and the annotation identification information to obtain a target image identification model, the multi-modality advantages of the images can be effectively utilized, the association characteristics between different modalities of the sample images can be used as annotation data to perform model training, so that the trained target image identification model can effectively learn and model the relationship between different modalities of the images, the recognition performance of the target image recognition model is effectively improved, and the recognition effect of the target image recognition model is improved.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure.

As shown in fig. 2, the training method of the image recognition model includes:

s201: and acquiring a plurality of sample images, wherein the plurality of sample images correspond to a plurality of modalities respectively, and the plurality of sample images correspond to the label identification information.

S202: and determining a plurality of labeling relation characteristics respectively corresponding to the plurality of sample images according to the plurality of modalities, wherein the labeling relation characteristics describe the association between the modality of the corresponding sample image and the modalities of other sample images, and the corresponding sample image and the other sample images jointly form the plurality of sample images.

For the description of S201-S202, reference may be made to the above embodiments, which are not described herein again.

The initial image recognition model in this embodiment may include: the image recognition method includes the steps of obtaining a plurality of residual error networks, a plurality of cooperative attention networks respectively connected with the plurality of residual error networks, and an image recognition model to be trained, wherein the model structure of the initial image recognition model is not limited to the above, and any other possible artificial intelligence model structure can be adopted without limitation.

S203: and respectively inputting the plurality of sample images into the corresponding plurality of residual error networks to obtain a plurality of initial modal characteristics respectively output by the plurality of residual error networks, wherein the initial modal characteristics are characteristics which are obtained by prediction and describe the modal of the corresponding sample images.

The initial modal characteristics are characteristics which are obtained by prediction and describe the modal of the corresponding sample image.

The features of the mode may specifically be, for example, features of a three primary color (Red Green Blue, RGB) mode, features of a Depth (Depth) mode, and features of an Infrared (IR) mode, and the features may specifically be, for example, semantic features, light features, pixel features, and the like of the mode, which is not limited thereto.

In the embodiment of the present disclosure, the plurality of sample images may be respectively input to the corresponding plurality of residual error networks to obtain modal characteristics output by the plurality of residual error networks and respectively corresponding to the plurality of sample images, where the modal characteristics may be referred to as initial modal characteristics.

For example, the embodiment of the present disclosure may be specifically explained with reference to fig. 3, where fig. 3 is a schematic structural diagram of an image recognition model provided according to the embodiment of the present disclosure, and includes: as shown in fig. 3, a plurality of sample images may be respectively input into a plurality of parallel branch residual networks, the plurality of parallel residual networks may have a plurality of corresponding residual modules (ResNet Block), and accordingly, the plurality of sample images are respectively input into the plurality of corresponding residual networks, or the plurality of sample images may be respectively input into the plurality of residual modules of the plurality of parallel branch convolutional neural networks, so as to obtain a plurality of initial modal characteristics output by the plurality of residual modules, where the dimensions of the initial modal characteristics are (C, H, W), where C represents a channel dimension, H is a height of the characteristic, and W is a width of the characteristic, which is not limited thereto.

That is to say, the embodiment may also support predicting the characteristics of the modalities of the corresponding sample images by using a residual error network, and then determining the associated characteristics between different modalities according to the characteristics of the predicted modalities in combination with a cooperative attention network, that is, training is performed on the performance of predicting the associated characteristics between different modalities by using a model, so as to assist in predicting the associated characteristics between different modalities directly based on the images of the different modalities in an actual image processing and image recognition scene.

S204: and respectively inputting the initial modal characteristics into a plurality of corresponding cooperative attention networks to obtain a plurality of predicted relationship characteristics respectively output by the cooperative attention networks.

After the plurality of sample images are respectively input into the corresponding plurality of residual error networks to obtain the plurality of initial modality features respectively output by the plurality of residual error networks, the plurality of initial modality features can be respectively input into the corresponding plurality of cooperative attention networks to obtain the plurality of relationship features respectively output by the plurality of cooperative attention networks, the relationship features can be called prediction relationship features, and the prediction relationship features can be used for describing the correlation between the modality of the corresponding sample image and the modalities of other sample images.

That is, in this embodiment, as shown in fig. 3, after the plurality of residual modules (ResNet Block), the corresponding collaborative attention networks may be connected, and then the obtained dimensions (C, H, W) may be input into the corresponding plurality of collaborative attention networks, so as to obtain a plurality of predicted relationship features output by the plurality of collaborative attention networks, respectively.

S205: and when the plurality of prediction relation features and the plurality of label relation features which respectively correspond to the plurality of prediction relation features meet a first convergence condition, training the image recognition model to be trained according to the plurality of prediction relation features and the label recognition information to obtain the target image recognition model.

The image recognition model to be trained currently can be referred to as an image recognition model to be trained.

That is to say, after obtaining the plurality of predicted relationship features respectively output by the plurality of collaborative attention networks, the identification information may be labeled to train the image identification model to be trained according to the plurality of predicted relationship features, so as to obtain the target image identification model.

For example, as shown in fig. 3, after obtaining the plurality of prediction relationship features, the plurality of prediction relationship features may be input into the to-be-trained image recognition model, then the plurality of input prediction relationship features are subjected to feature addition processing, and a result of the addition processing is input into a full Connected Layer (FC Layer) and an activation function, i.e., an S-type function, in the to-be-trained image recognition model, so as to train the to-be-trained image recognition model, and obtain the target image recognition model.

Here, the convergence condition configured in advance for each of the plurality of predicted relationship features and the plurality of labeled relationship features may be referred to as a first convergence condition.

In some embodiments, whether the plurality of prediction relationship features and the plurality of label relationship features respectively corresponding to the plurality of prediction relationship features satisfy the first convergence condition may be pre-configured with corresponding loss functions, the plurality of prediction relationship features and the plurality of label relationship features respectively corresponding to the plurality of prediction relationship features are used as input parameters of the loss functions, output values of the loss functions are determined, then the loss values are compared with a set loss threshold, and if the loss values are smaller than the loss threshold, it may be determined that the plurality of prediction relationship features and the plurality of label relationship features respectively corresponding to the plurality of prediction relationship features satisfy the first convergence condition.

When the plurality of prediction relation features and the plurality of label relation features respectively corresponding to the prediction relation features meet the first convergence condition, the image recognition model to be trained can be trained according to the plurality of prediction relation features and the label recognition information to obtain the target image recognition model, because the image recognition model to be trained is trained when the plurality of prediction relation features and the plurality of label relation features respectively corresponding to the label relation features meet the first convergence condition, the training cost of the image recognition model can be effectively reduced, the association features among different modes can be effectively fused when the image recognition model is trained, the prediction accuracy of the prediction relation features can be effectively guaranteed, the performance of predicting the association features among different modes by the model can be trained in advance, and therefore, in the actual image processing and image recognition scenes, the association features among different modes can be predicted directly based on the input images of different modes, the training efficiency of the image recognition model can be improved to a greater extent.

For example, the plurality of predicted relationship features and the labeled recognition information may be input into the image recognition model to be trained to obtain a predicted recognition result output by the image recognition model, the predicted recognition result and the labeled recognition information may be used as input parameters of a loss function, a loss value output by the loss function is determined, the loss value is compared with a set loss threshold to determine whether convergence timing is satisfied (if the convergence timing is satisfied, the image recognition model to be trained may be indicated to be converged), and if the image recognition model to be trained is determined to be converged, the trained image recognition model may be used as a target image recognition model, which is not limited thereto.

Optionally, in some embodiments, the image recognition model to be trained is trained according to a plurality of predicted relationship features and the labeled recognition information to obtain the target image recognition model, the plurality of predicted relationship features may be input into the image recognition model to be trained to obtain the predicted recognition information output by the image recognition model to be trained, and when a second convergence condition is satisfied between the predicted recognition information and the labeled recognition information, the trained image recognition model is used as the target image recognition model, thereby effectively considering the prediction accuracy of the predicted relationship features and the recognition accuracy of the image recognition model, enabling the convergence judgment processing logic to be more targeted, effectively adapting to the performance requirements of executing different recognition and prediction task network structures, effectively expanding the application scenarios, and also ensuring the accuracy of the convergence time judgment of the image recognition model, effectively assisting to improve the training effect of the image recognition model.

The convergence condition configured in advance for the prediction identification information and the label identification information may be referred to as a second convergence condition.

After the prediction relationship features are obtained, the plurality of prediction relationship features may be input into the to-be-trained image recognition model to obtain the recognition information output by the to-be-trained image recognition model, and the recognition information may be referred to as prediction recognition information.

In the embodiment of the disclosure, after the predicted identification information output by the image identification model to be trained is obtained, it can be judged that the second convergence condition is satisfied between the predicted identification information and the labeled identification information, and when the second convergence condition is satisfied between the predicted identification information and the labeled identification information, the image identification model obtained by training is used as the target image identification model.

For example, a corresponding loss function may be configured in advance, the plurality of pieces of predicted identification information and the plurality of pieces of labeled identification information may be used as input parameters of the loss function, an output value of the loss function is determined, then the loss value is compared with a set loss threshold, if the loss value is smaller than the loss threshold, it may be determined that a second convergence condition is satisfied between the predicted identification information and the labeled identification information, and at this time, the trained image identification model may be used as the target image identification model, which is not limited.

In this embodiment, a plurality of sample images are obtained, wherein the plurality of sample images correspond to a plurality of modalities respectively, the plurality of sample images correspond to annotation identification information, a plurality of annotation relation characteristics corresponding to the plurality of sample images are determined according to the plurality of modalities, the plurality of sample images are input into a plurality of residual error networks respectively to obtain a plurality of initial modality characteristics output by the plurality of residual error networks respectively, the plurality of initial modality characteristics are input into a plurality of collaborative attention networks respectively to obtain a plurality of predicted relation characteristics output by the plurality of collaborative attention networks respectively, when a first convergence condition is satisfied between the plurality of predicted relation characteristics and the plurality of annotation relation characteristics respectively corresponding to the plurality of annotation relation characteristics, an image identification model to be trained can be trained according to the plurality of predicted relation characteristics and the annotation identification information to obtain a target image identification model, the training of the image recognition model to be trained is carried out when the first convergence condition is met between the plurality of prediction relation features and the plurality of labeling relation features which respectively correspond to the prediction relation features, so that the training cost of the image recognition model can be effectively reduced, the association features among different modalities are effectively fused when the image recognition model is trained, the prediction accuracy of the prediction relation features can be effectively guaranteed, the performance of predicting the association features among the different modalities by the model in advance is trained, the method assists in predicting the association features among the different modalities on the basis of the images of the different modalities directly in the actual image processing and image recognition scenes, and the training efficiency of the image recognition model can be greatly improved.

Fig. 4 is a schematic diagram according to a third embodiment of the present disclosure.

As shown in fig. 4, the training method of the image recognition model includes:

s401: and acquiring a plurality of sample images, wherein the plurality of sample images correspond to a plurality of modalities respectively, and the plurality of sample images correspond to the label identification information.

S402: and determining a plurality of labeling relation characteristics respectively corresponding to the plurality of sample images according to the plurality of modalities, wherein the labeling relation characteristics describe the association between the modality of the corresponding sample image and the modalities of other sample images, and the corresponding sample image and the other sample images jointly form the plurality of sample images.

S403: and respectively inputting the plurality of sample images into the corresponding plurality of residual error networks to obtain a plurality of initial modal characteristics respectively output by the plurality of residual error networks, wherein the initial modal characteristics are characteristics which are obtained by prediction and describe the modal of the corresponding sample images.

For the description of S401 to S403, reference may be made to the above embodiments, which are not described herein again.

S404: and respectively inputting the initial modal characteristics into a plurality of corresponding cooperative attention networks to obtain a plurality of attention response characteristics respectively output by the cooperative attention networks.

After obtaining the plurality of initial modal characteristics, the plurality of initial modal characteristics may be input into the corresponding plurality of coordinated attention networks to obtain a plurality of characteristics output by the plurality of coordinated attention networks, respectively.

S405: and determining a plurality of reference attention response characteristics corresponding to the initial modal characteristics, wherein the reference attention response characteristics are other attention response characteristics except the attention response characteristics corresponding to the initial modal characteristics in the plurality of attention response characteristics.

Among the plurality of attention response features, the attention response features other than the plurality of attention features corresponding to the initial modality feature may be referred to as reference attention features, and the reference attention features may be used for reference processing of the initial modality feature.

That is, in the embodiment of the present disclosure, the determining of the plurality of reference attention features corresponding to the initial modality feature may be determining an attention response feature corresponding to the initial modality feature from among the plurality of attention response features, and then using other attention response features except the attention response feature as the reference attention feature.

S406: and processing the initial modal characteristics according to the plurality of reference attention response characteristics to obtain corresponding predicted relationship characteristics.

In the embodiment of the disclosure, the reference attention response feature is determined by combining the cooperative attention network, so that the attention response feature can represent a feature region with higher discriminant, and therefore, when the initial modal feature is processed according to a plurality of reference attention response features, feature fusion among the multi-modal features can be effectively realized, and the feature representation capability of the predictive relational feature is effectively improved.

In some embodiments, the initial modality feature is processed according to the plurality of reference attention response features to obtain the corresponding predicted relationship feature, which may be, without limitation, respectively performing feature connection processing or feature addition processing on the plurality of reference attention features and the initial modality feature, and taking the feature obtained by the foregoing processing as the predicted relationship feature.

Optionally, in other embodiments, the initial modal features are processed according to the plurality of reference attention response features to obtain corresponding predicted relationship features, where the first modal features and the second modal features are obtained by parsing from the initial modal features, where the first modal features are different from the second modal features, the first modal features and the second modal features are connected to obtain reference modal features, the initial modal features and the reference modal features are fused to obtain to-be-processed modal features, and the to-be-processed modal features are processed according to the plurality of reference attention response features to obtain corresponding predicted relationship features.

The first modal characteristic is a local modal characteristic corresponding to the background region in the sample image corresponding to the initial modal characteristic, the second modal characteristic is a gradient information characteristic in the sample image corresponding to the initial modal characteristic, and the first modal characteristic and the second modal characteristic are different.

As the local modal characteristics and the gradient information characteristics which are more discriminant are determined from the initial modal characteristics, the characteristic area which is more discriminant can be focused during the execution process of the subsequent training method of the image recognition model, so that the characteristic processing logic can be simplified to a certain extent, and the training efficiency of the image recognition model can be effectively improved in an auxiliary manner.

That is, as shown in fig. 3, after the initial modal feature with the dimension (C, H, W) is obtained, the initial modal feature may be subjected to an average pooling process in the channel dimension to obtain a local modal feature corresponding to the background region in the sample image, and the local modal feature may be used as the first modal feature (the feature dimension is (1, H, W)).

Accordingly, after the initial modality feature with the dimension (C, H, W) is obtained, the initial modality feature may be subjected to maximum pooling in the channel dimension to obtain a gradient information feature (with the feature dimension (1, H, W)) in the sample image corresponding to the initial modality feature.

After the first modal characteristic and the second modal characteristic are obtained by analyzing the initial modal characteristic. The first modal characteristic and the second modal characteristic may be subjected to a connection processing to obtain a connection-processed modal characteristic, which may be referred to as a reference modal characteristic.

That is, as shown in fig. 3, the corresponding modal feature having the feature dimension (1, H, W) and the second modal feature having the feature dimension (1, H, W) may be subjected to the feature connection processing to obtain the reference modal feature having the feature dimension (2, H, W).

After the reference modal feature with the feature dimension of (2, H, W) is obtained, the initial modal feature and the reference modal feature may be subjected to fusion processing to obtain a modal feature after the fusion processing, and the modal feature may be referred to as a modal feature to be processed.

For example, a convolution and modified Linear Unit (ReLU) activation function of 1 × 1 may be adopted to perform fusion processing on the reference modal feature and the initial modal feature, and then the feature after the fusion processing is input into the Sigmoid function, so as to obtain the to-be-processed modal feature output by the Sigmoid function.

In the embodiment of the disclosure, after the reference modal feature and the initial modal feature are subjected to fusion processing to obtain the modal feature to be processed, the modal feature to be processed may be processed according to the multiple reference attention response features to obtain the corresponding prediction relationship feature, and since the reference modal feature is obtained by performing connection processing on the first modal feature and the second modal feature, the reference modal feature may be effectively ensured to be capable of sufficiently expressing the background information and the gradient information of the sample image, and thus, when the corresponding prediction relationship feature is generated based on the reference modal feature, the prediction accuracy of the prediction relationship feature may be effectively ensured, so that the prediction relationship feature may have a higher reference value.

Optionally, in other embodiments, the modality feature to be processed is processed according to a plurality of reference attention response features, to obtain corresponding prediction relation characteristics, or to add a plurality of reference attention response characteristics, so as to obtain the target attention response characteristic, and multiply the target attention response characteristic and the modal characteristic to be processed, so as to obtain corresponding prediction relation characteristics, because the plurality of reference attention characteristics are added, multiplying the target attention response characteristic obtained by the addition processing and the modal characteristic to be processed to obtain a corresponding prediction relation characteristic, thereby forming a closed-loop complementary cooperation among a plurality of modes, effectively realizing complementary enhancement among a plurality of mode characteristics, the predicted relation characteristics obtained by prediction can effectively express the relation between different modalities of the image.

As shown in fig. 3, after obtaining the modal feature to be processed, the plurality of reference attention response features may be subjected to an addition process to obtain a corresponding attention feature, which may be referred to as a target attention feature.

As shown in fig. 3, after the target attention feature is obtained, the target attention response feature and the modal feature to be processed may be multiplied, that is, the target attention response feature and the modal feature to be processed may be multiplied to the modal feature to be processed, respectively, so as to obtain a plurality of predicted relationship features.

S407: and when the plurality of prediction relation features and the plurality of label relation features which respectively correspond to the plurality of prediction relation features meet a first convergence condition, training the image recognition model to be trained according to the plurality of prediction relation features and the label recognition information to obtain the target image recognition model.

For description of S407, reference may be made to the foregoing embodiments specifically, and details are not repeated here.

To sum up, the embodiment of the present disclosure performs feature collaborative complementary enhancement on different modal features of an image, so as to dig out a more discriminative feature from one of the modal features, and collaboratively complement the more discriminative feature to other modal features, so that an image recognition model obtained by training can obtain a more discriminative fusion feature, and thus when the image recognition model provided by the embodiment of the present disclosure is applied to an application scenario of face live detection, the performance of face live detection can be effectively improved, so that face live detection can be applied to a variety of scenarios such as security, attendance, finance, entrance guard traffic, and the like, and the effects and user experiences of a variety of applications based on the face live detection technology can be effectively assisted and improved, which is beneficial to further popularization of the face live detection item.

In this embodiment, a plurality of sample images are obtained, wherein the plurality of sample images correspond to a plurality of modalities respectively, the plurality of sample images correspond to annotation identification information, and a plurality of annotation relation features corresponding to the plurality of sample images are determined according to the plurality of modalities, wherein the annotation relation features describe association between the modality of the corresponding sample image and the modalities of other sample images, the corresponding sample image and the other sample images jointly form the plurality of sample images, the plurality of sample images are input into a plurality of corresponding residual error networks respectively to obtain a plurality of initial modality features output by the plurality of residual error networks respectively, the plurality of initial modality features are input into a plurality of corresponding cooperative attention networks respectively to obtain a plurality of attention response features output by the plurality of cooperative attention networks respectively, and a plurality of reference attention response features corresponding to the initial modality features are determined, and processing the initial modal characteristics according to the plurality of reference attention response characteristics to obtain corresponding prediction relation characteristics, and when the initial modal characteristics are processed according to the plurality of reference attention response characteristics, effectively realizing characteristic fusion among multi-modal characteristics, effectively improving the characteristic representation capability of the prediction relation characteristics, training an image recognition model to be trained according to the plurality of prediction relation characteristics and the label recognition information when a first convergence condition is satisfied between the plurality of prediction relation characteristics and the plurality of label relation characteristics respectively corresponding to the plurality of reference attention response characteristics to obtain a target image recognition model, effectively improving the recognition performance of the target image recognition model, and improving the recognition effect of the target image recognition model.

Fig. 5 is a schematic diagram according to a fourth embodiment of the present disclosure.

It should be noted that an execution subject of the image generation method of this embodiment is an image generation apparatus, the apparatus may be implemented by software and/or hardware, the apparatus may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

As shown in fig. 5, the image recognition method includes:

s501: and acquiring a plurality of images to be identified, wherein the plurality of images to be identified respectively correspond to a plurality of modalities.

The image to be currently recognized may be referred to as a to-be-recognized image, the to-be-recognized image may be an image obtained by image capture of an arbitrary target object, and the to-be-recognized image may also be a partial frame video image extracted from a plurality of video frames, which is not limited to this.

S502: and respectively inputting the images to be recognized into the target image recognition model obtained by the training method of the image recognition model to obtain the target recognition information output by the target image recognition model.

After acquiring the plurality of images to be recognized, the plurality of images to be recognized may be respectively input into the target image recognition model obtained by the training method of the image recognition model, so as to obtain the recognition information output by the target image recognition model, where the recognition information may be referred to as target recognition information.

In this embodiment, a plurality of images to be recognized are obtained, wherein the plurality of images to be recognized correspond to a plurality of modalities respectively, and then the plurality of images to be recognized are input into the target image recognition model obtained by the training method of the image recognition model, so as to obtain target recognition information output by the target image recognition model.

Fig. 6 is a schematic diagram according to a fifth embodiment of the present disclosure.

As shown in fig. 6, the training apparatus 60 for image recognition models includes:

a first obtaining module 601, configured to obtain a plurality of sample images, where the plurality of sample images correspond to multiple modalities respectively, and the plurality of sample images correspond to labeled identification information;

a determining module 602, configured to determine, according to multiple modalities, multiple annotation relation features corresponding to multiple sample images, where the annotation relation features describe association between modalities of corresponding sample images and modalities of other sample images, and the corresponding sample images and the other sample images jointly form multiple sample images; and

the training module 603 is configured to train an initial image recognition model according to the multiple sample images, the multiple labeling relationship features, and the labeling recognition information to obtain a target image recognition model.

In some embodiments of the present disclosure, as shown in fig. 7, fig. 7 is a schematic diagram of a training apparatus 70 for a face image recognition model according to a sixth embodiment of the present disclosure, which includes: the device comprises a first obtaining module 701, a determining module 702, and a training module 703, wherein the training module 703 includes:

the first processing sub-module 7031 is configured to input the plurality of sample images into the corresponding plurality of residual error networks, respectively, so as to obtain a plurality of initial modal characteristics output by the plurality of residual error networks, where the initial modal characteristics are characteristics describing the modalities of the corresponding sample images obtained by prediction;

the second processing submodule 7032 is configured to input the multiple initial modal characteristics into the multiple corresponding collaborative attention networks, respectively, so as to obtain multiple predicted relationship characteristics output by the multiple collaborative attention networks, respectively;

the training sub-module 7033 is configured to train the to-be-trained image recognition model according to the multiple prediction relationship features and the label recognition information when the multiple prediction relationship features and the multiple label relationship features respectively corresponding to the multiple label relationship features satisfy a first convergence condition, so as to obtain a target image recognition model.

In some embodiments of the present disclosure, among others, the training submodule 7033 is specifically configured to:

inputting the plurality of prediction relation characteristics into the to-be-trained image recognition model to obtain prediction recognition information output by the to-be-trained image recognition model;

and if the predicted identification information and the labeled identification information meet a second convergence condition, taking the image identification model obtained by training as a target image identification model.

In some embodiments of the present disclosure, among others, second processing submodule 7032 includes:

an obtaining unit 70321, configured to input the initial modal characteristics into the corresponding attention coordination networks, respectively, so as to obtain attention response characteristics output by the attention coordination networks, respectively;

a determining unit 70322, configured to determine a plurality of reference attention response characteristics corresponding to the initial modal characteristic, where the reference attention response characteristics are attention response characteristics other than the attention response characteristic corresponding to the initial modal characteristic among the plurality of attention response characteristics;

a processing unit 70323, configured to process the initial modality feature according to the plurality of reference attention response features to obtain a corresponding predicted relationship feature.

In some embodiments of the present disclosure, among others, the processing unit 70323 is specifically configured to:

analyzing the initial modal characteristics to obtain first modal characteristics and second modal characteristics, wherein the first modal characteristics and the second modal characteristics are different;

connecting the first modal characteristic and the second modal characteristic to obtain a reference modal characteristic;

performing fusion processing on the initial modal characteristics and the reference modal characteristics to obtain modal characteristics to be processed;

and processing the modal characteristics to be processed according to the reference attention response characteristics to obtain corresponding prediction relation characteristics.

adding the plurality of reference attention response characteristics to obtain a target attention response characteristic;

and multiplying the target attention response characteristic and the modal characteristic to be processed to obtain a corresponding prediction relation characteristic.

In some embodiments of the present disclosure, the first modality feature is a local modality feature corresponding to the background region in the sample image corresponding to the initial modality feature, and the second modality feature is a gradient information feature in the sample image corresponding to the initial modality feature.

It is understood that the training apparatus 70 of the image recognition model in fig. 7 of the present embodiment may have the same functions and structures as the training apparatus 60 of the image recognition model in the above-described embodiment, the first obtaining module 701 may have the same functions and structures as the first obtaining module 601 in the above-described embodiment, the determining module 702 may have the same structures as the determining module 602 in the above-described embodiment, and the training module 703 may have the same structures as the training module 603 in the above-described embodiment.

The above explanation of the training method for the image recognition model is also applicable to the training apparatus for the image recognition model of the present embodiment.

Fig. 8 is a schematic diagram according to a seventh embodiment of the present disclosure.

As shown in fig. 8, the image recognition apparatus 80 includes:

a second obtaining module 801, configured to obtain a plurality of images to be identified, where the plurality of images to be identified respectively correspond to multiple modalities;

the generating module 802 is configured to input a plurality of images to be recognized into the target image recognition model obtained by training the training apparatus of the image recognition model, so as to obtain target recognition information output by the target image recognition model.

It should be noted that the foregoing explanation of the image recognition method is also applicable to the image recognition apparatus of the present embodiment, and is not repeated here.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device to implement the training method of the image recognition model of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an acquisition unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a training method of an image recognition model, or an image recognition method. For example, in some embodiments, the training method of the image recognition model, or the image recognition method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, the training method of the image recognition model described above, or one or more steps of the image recognition method may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g. by means of firmware) to perform a training method of the image recognition model, or an image recognition method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of an image recognition model comprises the following steps:

acquiring a plurality of sample images, wherein the plurality of sample images respectively correspond to a plurality of modalities, and the plurality of sample images are correspondingly marked with identification information;

determining a plurality of labeling relation features respectively corresponding to the plurality of sample images according to the plurality of modalities, wherein the labeling relation features describe the association between the modalities of the corresponding sample images and the modalities of other sample images, and the corresponding sample images and the other sample images jointly form the plurality of sample images; and

and training an initial image recognition model according to the plurality of sample images, the plurality of labeling relation characteristics and the labeling recognition information to obtain a target image recognition model.

2. The method of claim 1, the initial image recognition model comprising: a plurality of residual error networks, a plurality of cooperative attention networks respectively connected with the residual error networks, and an image recognition model to be trained,

wherein, the training an initial image recognition model according to the plurality of sample images, the plurality of labeling relationship characteristics, and the labeling recognition information to obtain a target image recognition model comprises:

inputting the sample images into the residual error networks respectively to obtain initial modal characteristics output by the residual error networks respectively, wherein the initial modal characteristics are characteristics which are obtained by prediction and describe the modal of the corresponding sample images;

inputting the initial modal characteristics into the cooperative attention networks respectively to obtain a plurality of predicted relationship characteristics output by the cooperative attention networks respectively;

and when a first convergence condition is satisfied between the plurality of prediction relation features and the plurality of labeling relation features respectively corresponding to the plurality of prediction relation features, training the image recognition model to be trained according to the plurality of prediction relation features and the labeling recognition information to obtain the target image recognition model.

3. The method of claim 2, wherein the training the image recognition model to be trained according to the plurality of predicted relationship features and the label recognition information to obtain the target image recognition model comprises:

and if the predicted identification information and the labeled identification information meet a second convergence condition, taking the image identification model obtained by training as the target image identification model.

4. The method according to claim 2, wherein the inputting the plurality of initial modal characteristics into the corresponding plurality of collaborative attention networks respectively to obtain a plurality of predicted relationship characteristics output by the plurality of collaborative attention networks respectively comprises:

inputting the initial modal characteristics into the cooperative attention networks respectively to obtain attention response characteristics output by the cooperative attention networks respectively;

determining a plurality of reference attention response characteristics corresponding to the initial modal characteristics, wherein the reference attention response characteristics are attention response characteristics except the attention response characteristics corresponding to the initial modal characteristics in the plurality of attention response characteristics;

and processing the initial modal characteristics according to the plurality of reference attention response characteristics to obtain corresponding predicted relationship characteristics.

5. A method according to claim 4, wherein said processing the initial modality features according to the plurality of reference attention response features to derive corresponding predicted relationship features comprises:

and processing the modal characteristics to be processed according to the reference attention response characteristics to obtain the corresponding prediction relation characteristics.

6. The method according to claim 5, wherein said processing the modality feature to be processed according to the plurality of reference attention response features to obtain the corresponding predicted relationship feature comprises:

and multiplying the target attention response characteristic and the modal characteristic to be processed to obtain the corresponding prediction relation characteristic.

7. The method according to claim 5, wherein the first modality feature is a local modality feature corresponding to a background region in the sample image corresponding to the initial modality feature, and the second modality feature is a gradient information feature in the sample image corresponding to the initial modality feature.

8. An image recognition method, comprising:

acquiring a plurality of images to be identified, wherein the images to be identified respectively correspond to a plurality of modalities;

respectively inputting the images to be recognized into the target image recognition models obtained by training the training method of the image recognition models according to any one of claims 1 to 7 to obtain the target recognition information output by the target image recognition models.

9. An apparatus for training an image recognition model, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of sample images, the plurality of sample images respectively correspond to a plurality of modalities, and the plurality of sample images correspond to label identification information;

a determining module, configured to determine, according to the multiple modalities, multiple annotation relation features corresponding to the multiple sample images, where the annotation relation features describe association between a modality of a corresponding sample image and modalities of other sample images, and the corresponding sample image and the other sample images together form the multiple sample images; and

and the training module is used for training an initial image recognition model according to the plurality of sample images, the plurality of labeling relation characteristics and the labeling recognition information so as to obtain a target image recognition model.

10. The apparatus of claim 9, the initial image recognition model comprising: a plurality of residual error networks, a plurality of cooperative attention networks respectively connected with the residual error networks, and an image recognition model to be trained,

wherein the training module comprises:

the first processing submodule is used for respectively inputting the plurality of sample images into the corresponding plurality of residual error networks so as to obtain a plurality of initial modal characteristics respectively output by the plurality of residual error networks, wherein the initial modal characteristics are characteristics which are obtained by prediction and describe the modal of the corresponding sample images;

the second processing submodule is used for respectively inputting the initial modal characteristics into the plurality of cooperative attention networks so as to obtain a plurality of predicted relationship characteristics respectively output by the cooperative attention networks;

and the training sub-module is used for training the image recognition model to be trained according to the plurality of prediction relation features and the label recognition information when a first convergence condition is met between the plurality of prediction relation features and the plurality of label relation features respectively corresponding to the plurality of label relation features so as to obtain the target image recognition model.

11. The apparatus of claim 10, wherein the training submodule is specifically configured to:

12. The apparatus of claim 10, wherein the second processing submodule comprises:

an obtaining unit, configured to input the multiple initial modal features into the multiple coordinated attention networks, respectively, so as to obtain multiple attention response features output by the multiple coordinated attention networks, respectively;

a determining unit, configured to determine a plurality of reference attention response features corresponding to the initial modal feature, where the reference attention response features are attention response features other than the attention response feature corresponding to the initial modal feature among the plurality of attention response features;

and the processing unit is used for processing the initial modal characteristics according to the plurality of reference attention response characteristics to obtain corresponding prediction relation characteristics.

13. The apparatus according to claim 12, wherein the processing unit is specifically configured to:

14. The apparatus according to claim 13, wherein the processing unit is specifically configured to:

15. The apparatus according to claim 13, wherein the first modality feature is a local modality feature corresponding to a background region in the sample image corresponding to the initial modality feature, and the second modality feature is a gradient information feature in the sample image corresponding to the initial modality feature.

16. An image recognition apparatus comprising:

the second acquisition module is used for acquiring a plurality of images to be identified, wherein the images to be identified respectively correspond to a plurality of modalities;

a generating module, configured to input the multiple images to be recognized into the target image recognition models obtained by training the training apparatus of the image recognition model according to any one of claims 9 to 15, respectively, so as to obtain the target recognition information output by the target image recognition models.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 or to perform the method of claim 8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-7 or to perform the method of claim 8.

19. A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7 or carries out the steps of the method according to claim 8.