CN115147679B

CN115147679B - Multi-mode image recognition method and device, model training method and device

Info

Publication number: CN115147679B
Application number: CN202210768182.1A
Authority: CN
Inventors: 张婉平
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2023-11-14
Anticipated expiration: 2042-06-30
Also published as: CN115147679A

Abstract

The invention provides a multi-mode image recognition model training method and device, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as face recognition. The specific implementation scheme is as follows: selecting target image samples of at least two modes from a multi-target multi-mode image sample set constructed in advance, wherein the target image samples all have the same target; inputting any one mode image sample in the target image sample into a target network for feature extraction to obtain target features; inputting a target image sample into an online network to perform feature extraction to obtain a first online feature; respectively inputting the first online features into corresponding feature queues to obtain one-to-one corresponding feature sequences; based on the target features and the feature sequences, training a multi-mode image recognition model corresponding to the online network. The embodiment improves the accuracy of multi-mode image recognition.

Description

Multi-mode image recognition method and device, model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as face recognition, and particularly relates to a multi-mode image recognition model training method and device, a multi-mode image recognition method and device, electronic equipment, a computer readable medium and a computer program product.

Background

In the process of object identification application, it is sometimes necessary to be compatible with the identification of objects in visible light images and near infrared images. Because the styles of the visible light image and the near infrared image are very different, the effect of the recognition model obtained by directly mixing the visible light image and the near infrared image together is poor.

Disclosure of Invention

A multimodal image recognition model training method and apparatus, a multimodal image recognition method and apparatus, an electronic device, a computer readable medium, and a computer program product are provided.

According to a first aspect, there is provided a multi-modal image recognition model training method, the method comprising: selecting target image samples of at least two modes from a multi-target multi-mode image sample set constructed in advance, wherein the target image samples all have the same target; inputting any one mode image sample in the target image sample into a target network for feature extraction to obtain target features; inputting the target image sample into an online network for feature extraction to obtain a first online feature, wherein the online network and the target network have the same network structure; respectively inputting the first online features into corresponding feature queues to obtain one-to-one corresponding feature sequences; based on the target characteristics and the characteristic sequences, training a multi-mode image recognition model corresponding to the online network.

According to a second aspect, there is provided a multi-modal image recognition method, the method comprising: acquiring an image having at least two modalities; inputting the image into a multi-mode image recognition model generated by adopting the method described by any implementation mode of the first aspect to obtain the characteristics of the image; and obtaining a recognition result of the target in the image based on the characteristics of the image.

According to a third aspect, there is provided a multimodal image recognition model training apparatus, the apparatus comprising: the sample selection unit is configured to select target image samples of at least two modes from a multi-target multi-mode image sample set constructed in advance, wherein the target image samples all have the same target; the target obtaining unit is configured to input any one mode image sample in the target image sample into a target network for feature extraction to obtain target features of the target image sample; the online obtaining unit is configured to input the target image sample into an online network for feature extraction to obtain a first online feature of the target image sample, wherein the online network and the target network have the same network structure; the sequence obtaining unit is configured to input the first online features into corresponding feature queues respectively to obtain feature sequences in one-to-one correspondence; and the training unit is configured to train the multi-mode image recognition model corresponding to the online network based on the target features and the feature sequence.

According to a fourth aspect, there is provided a multi-modal image recognition apparatus, the apparatus comprising: an image acquisition unit configured to acquire an image having at least two modalities; a feature obtaining unit configured to input an image into a multi-modal image recognition model generated by the apparatus described in any implementation manner of the third aspect, to obtain features of the image; and the image recognition unit is configured to obtain a recognition result of the target in the image based on the characteristics of the image.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first or second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first or second aspect.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.

The method and the device for generating the pre-training model provided by the embodiment of the disclosure firstly select target image samples of at least two modes from a pre-constructed multi-target multi-mode image sample set, wherein the target image samples all have the same target; secondly, inputting any mode image sample in the target image sample into a target network for feature extraction to obtain target features; thirdly, inputting the target image sample into an online network for feature extraction to obtain a first online feature, wherein the online network and the target network have the same network structure; secondly, respectively inputting the first online features into corresponding feature queues to obtain one-to-one corresponding feature sequences; and finally, training a multi-mode image recognition model corresponding to the online network based on the target characteristics and the characteristic sequence. In the multi-mode image recognition model training process, the first online features are fully referenced to train the online network, so that the online network can determine the features of target image samples of all modes, and the accuracy of the multi-mode image recognition model in recognizing the features of targets in all the mode images is improved; and the multi-modal image recognition model is trained by adopting an online network and a target network with the same network structure, and the contrast learning and the feature queue are applied to the target feature recognition of the multi-modal image, so that the multi-modal image recognition model can uniformly learn the image information of various modal images in the training, and the accuracy of target recognition is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of one embodiment of a multi-modal image recognition model training method according to the present disclosure;

FIG. 2 is a schematic diagram of a structure of the multimodal image recognition model training process of the present disclosure;

FIG. 3 is a flow chart of one embodiment of a multi-modality image recognition method according to the present disclosure;

FIG. 4 is a schematic structural view of an embodiment of a multi-modality image recognition model training arrangement according to the present disclosure;

FIG. 5 is a schematic structural view of an embodiment of a multi-modality image recognition device according to the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a multimodal image recognition model training method or a multimodal image recognition method of an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In this embodiment, "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.

In order to identify an object (e.g., a face) in an image, which sometimes needs to be compatible with images of multiple modes, but the effect of directly mixing the images of multiple modes together to train an object identification model is not good, the disclosure proposes a multi-mode image identification model training method based on multiple queues, by constructing feature queues of images of different modes, so that the multi-mode image identification model can uniformly learn information of the images of multiple modes in training, fig. 1 shows a flow 100 of one embodiment of the multi-mode image identification model training method according to the disclosure, where the multi-mode image identification model training method includes the following steps:

step 101, selecting target image samples of at least two modes from a pre-constructed multi-target multi-mode image sample set.

In this embodiment, the multi-target multi-mode image sample set includes at least one mode image sample, where each mode image sample may be an image formed by capturing objects (characters, animals, scenes, etc.) with different identities in different imaging modes and different illumination conditions, so that the multi-target multi-mode image sample set is a set of multi-mode images corresponding to different objects, for example, an image sample of one mode is an image obtained by capturing a first animal in a white light imaging mode, and an image sample of another mode is an image obtained by capturing a second animal in an infrared imaging mode. And in each iterative training process of the target network, the image sample selected from the multi-target multi-mode image sample set is a target image sample, and the target image sample is provided with a target.

In this embodiment, at least two kinds of target image samples (such as white light image and narrow-band light image) of different modes are selected from the multi-target multi-mode image sample set, and the target image samples all have the same target, which means that the pose, appearance, etc. of the targets of each mode image sample in the target image sample may be the same or different. And the multi-target multi-modal image sample set also has image samples of different modalities of other targets.

In this embodiment, the target identifier that may be used by each mode image sample in the target image sample is labeled, and when the multi-mode image recognition model is trained, the target image sample may be a plurality of images of different modes obtained from the same target identifier, for example, three images are sampled by the same target identifier (id), where the three images include: the system comprises a first mode image, a second mode image and a sampling image, wherein the sampling image is the first mode image or the second mode image.

Step 102, inputting any mode image sample in the target image sample into a target network for feature extraction to obtain target features.

In this embodiment, any one of the mode image samples in the target image sample is a sampling image selected from the target image sample, the target network is a network for identifying and extracting characteristics of the target in the image, and the sampling image is input into the target network, so that the target network can identify and extract the characteristics of the target in the sampling image. The target features are information used for reflecting the region, the belonging category and the like of the target in the sampling image.

In this embodiment, the target network is a feature extraction network, and the target network can extract features of targets in images of different modes at different moments by adjusting parameters of the target network, and the target network can adopt network structures such as VGG (Visual Geometry Group, visual geometry group network), residual network (e.g. Resnet 34) and the like.

And step 103, inputting the target image sample into an online network for feature extraction to obtain a first online feature.

Wherein the online network and the target network have the same network structure. The online network is also a characteristic extraction network, the online network can simultaneously extract the characteristics of targets in images of different modes by adjusting parameters of the online network, and the online network can adopt a VGG network, a residual error network and the like.

In this embodiment, the structures and functions of the online network and the target network are the same, but the ways of adjusting the network parameters of the target network and the online network are different when the multi-mode image recognition model is trained, so that the target network and the online network have differences in recognition results of different images. The online network is a network for identifying and extracting the characteristics of targets in images of different modes, and a target image sample (an image with multiple modes) is input into the target network at the current moment, so that the target network can identify and extract the first online characteristics of the targets in the target image sample. The first online characteristic is a characteristic used for reflecting information such as an area, a category and the like of a target in a target image sample at the current moment.

The first online feature and the target feature for the image sample of the same modality are both features of the target of the image sample of the modality, and may have nuances due to different network parameters of the online network and the target network.

In this embodiment, the target image sample includes at least two modal image samples of at least two modalities, and the first online feature is a feature output by the online network at the current moment. Therefore, at the current moment, the online network can perform feature recognition on targets in each mode image sample to obtain first online features of each mode image sample. At a historical moment before the current moment, the online network can perform feature recognition on targets (which may be different from the targets at the current moment) in each mode image sample to obtain historical online features of each mode image sample, wherein the historical online features are features of the historical moment for reflecting information of the targets (which may be the same as or different from the targets at the current moment) in the target image sample in an image region, a category and the like.

And 104, respectively inputting the first online features into corresponding feature queues to obtain one-to-one corresponding feature sequences.

In this embodiment, the feature queue is a preset memory storing features of the same or different targets. Each mode image sample in the corresponding multi-target multi-mode image sample set is provided with a characteristic queue, namely a plurality of characteristic queues are arranged in a network for training the multi-mode image recognition model, each characteristic queue corresponds to one mode image sample, and the number of modes in the multi-target multi-mode image sample set is the same as the number of the characteristic queues. After the first online features of the target image samples are obtained, the first online features of each mode image sample are respectively stored in a feature queue corresponding to each mode image sample, and further, as targets in the target image samples in each iteration training of the multi-mode image recognition model may be different, targets corresponding to any two online features stored in the feature queue may be the same or different, and it is to be noted that the online features in the feature queue may include the first online features stored at the current moment and may also include the historical online features stored at the historical moment, and whether the first online features or the historical online features are online features of targets output by the online network at different moments.

In this embodiment, the feature sequence is a sequence composed of a plurality of online features selected from the feature queue, where the selected online features include at least a first online feature, so that the feature sequence may be all online features in the feature queue, may include the first online feature and a part of historical online features, and may also be historical online features including the first online feature and a specific location in the feature queue.

Step 105, training a multi-mode image recognition model corresponding to the online network based on the target features and the feature sequence.

In this embodiment, the first online feature and the target feature of the same target belong to a positive sample, and the first online feature of the target is provided in the feature sequence, and possibly also the first online feature and the target feature which do not belong to the target are provided as positive samples, the first online feature and the target feature which do not belong to the target in the feature sequence are provided as negative samples, and the recognition capability of the online network is trained through the online samples in the target feature and the feature sequence, and the online network has the recognition capability of distinguishing the positive samples from the negative samples, so as to obtain the multi-mode image recognition model corresponding to the online network.

In this embodiment, the multi-mode image recognition model may recognize targets in all mode images in the multi-target multi-mode image sample set to obtain recognition results of each target in all mode image samples, where the recognition results of each target include: the method comprises the steps of mode characteristics of an image where targets are located, characteristics of each target, areas of each target in different mode images, types of each target and the like.

According to the multi-mode image recognition model training method provided by the embodiment of the disclosure, firstly, target image samples of at least two modes are selected from a multi-target multi-mode image sample set constructed in advance, and the target image samples all have the same target; secondly, inputting any mode image sample in the target image sample into a target network for feature extraction to obtain target features; thirdly, inputting the target image sample into an online network for feature extraction to obtain a first online feature, wherein the online network and the target network have the same network structure; secondly, respectively inputting the first online features into corresponding feature queues to obtain one-to-one corresponding feature sequences; and finally, training a multi-mode image recognition model corresponding to the online network based on the target characteristics and the characteristic sequence. In the multi-mode image recognition model training process, the first online features are fully referenced to train the online network, so that the online network can determine the features of target image samples of all modes, and the accuracy of the multi-mode image recognition model in recognizing the features of targets in all the mode images is improved; and the multi-modal image recognition model is trained by adopting an online network and a target network with the same network structure, and the contrast learning and the feature queue are applied to the target feature recognition of the multi-modal image, so that the multi-modal image recognition model can uniformly learn the image information of various modal images in the training, and the accuracy of target recognition is improved.

In some optional implementations of this embodiment, the step 105 may specifically include: updating a first parameter of a target network and a second parameter of an online network based on the target characteristics and the characteristic sequence; in response to determining that the target network meets the training completion condition, a multimodal image recognition model corresponding to the online network is obtained, the target network being trained based on the first parameter.

In this optional implementation manner, in each iterative training process of the target network (the loss of the target network is calculated by inputting a target image sample into the online network or the target network, and then the target network performs one iterative training), the errors of the online network and the target network can be calculated based on the target features and the feature sequences, and based on the errors, the first parameter of the online network and the second parameter of the target network are updated, and when the target network meets the training completion condition, it is determined that the target network meets the error requirement, so as to obtain the multi-mode image recognition model corresponding to the online network.

In this alternative implementation, the target network is trained based on a first parameter, i.e. based on the loss of the target network, the first parameter being the parameter that needs to be updated. The target network is identical in structure to the online network, and the second parameter may be the same as or related to the first parameter.

In this optional implementation manner, the calculating the error between the online network and the target network based on the target feature and the feature sequence includes: and calculating the difference between the labeling true values in the modal image samples corresponding to the target features and the feature sequences respectively to obtain the errors of the online network and the target network. And (3) based on the error, changing a second parameter of the online network and a first parameter (weight and deviation) of the target network to complete one-time iterative training of the target network, continuously inputting corresponding target image samples into the target network and the online network, and calculating the error between labeling true values in the modal image samples corresponding to the target features and the feature sequences respectively until the error reaches a minimum value.

In this alternative implementation, the training completion condition may include at least one of: the training iteration times of the target network reach a preset iteration threshold, the total loss value of the target network is smaller than the preset loss threshold, and the classification accuracy of the target network is within a preset range. For example, the predetermined iteration threshold is 5 ten thousand times, the predetermined loss threshold is 0.05, and the predetermined range is (70%, 100%).

In this alternative implementation manner, after each iteration training is completed, it is determined whether a training completion condition is satisfied once, and when the training completion condition is not satisfied, steps 101 to 105 are re-executed until the target network satisfies the training completion condition, so as to obtain the multi-mode image recognition model corresponding to the online network.

According to the method for training the multi-modal image recognition model, provided by the alternative implementation mode, the first parameter of the target network and the second parameter of the online network are updated based on the target feature and the feature sequence, and the multi-modal image recognition model corresponding to the online network is obtained when the target network meets the training completion condition, so that the online network of the multi-modal image recognition model fully references the properties of the target feature and the first online feature in the training process, and the accuracy of recognizing the multi-modal image of the multi-modal image recognition model is improved.

In some optional implementations of this embodiment, for "updating the first parameter of the target network and the second parameter of the online network based on the target feature and the feature sequence" in the step 105, the loss values of the network for the different modality image samples need to be calculated, and then the total loss value of the whole target network is determined based on the loss values of all the modality image samples, specifically, updating the first parameter of the target network and the second parameter of the online network based on the target feature and the feature sequence may include: calculating loss values of all modes based on the target features and the feature sequences; calculating a total loss value based on the loss values; the first parameter and the second parameter are updated based on the total loss value.

In this optional implementation manner, calculating the total loss value based on the loss value includes: and summing the loss values corresponding to all the characteristic sequences to obtain a total loss value.

Optionally, calculating the total loss value based on the loss value may further include: based on the duty ratio of each mode image sample in the target image sample in the task, different weight values are assigned to the loss values corresponding to each mode image sample, the loss values corresponding to each mode image sample are multiplied by the weight values to obtain the loss dividing values corresponding to each mode image sample, and all the loss dividing values are added to obtain the total loss value.

In this embodiment, updating the first parameter of the target network and the second parameter of the online network based on the total loss value includes: based on the total loss value, adopting a gradient descent method to adjust a first parameter of a target network and a second parameter of an online network, and detecting whether the total loss value reaches a preset loss threshold value; if the preset loss threshold is not reached, the first parameter of the target network and the second parameter of the online network are adjusted, the target image samples of at least two modes are selected from the multi-target multi-mode image sample set continuously, any mode image in the target image samples is input into the target network, the target image samples are input into the online network to obtain a characteristic sequence, and then the total loss value is calculated by adopting a gradient descent method until the target network meets the training completion condition.

The gradient descent method corresponds to a classification loss function that calculates the total loss value for finding the minimum value of the classification loss function in each iteration of training. And in response to the total loss value not reaching the predetermined loss threshold, adjusting the first parameter of the target network and the second parameter of the online network, and continuing to calculate the total loss value.

As one example, when the target image sample is a modal image sample of two modalities (a first modal image M1 and a second modal image M2), as shown in fig. 2, the sampled image C (the first modal image M1 or the second modal image M2) is input to the target network, resulting in a target feature q; inputting the first modal image M1 into an online network to obtain a first online characteristic k1, and inputting the second modal image M2 into the online network to obtain a first online characteristic k2; the method comprises the steps that a first online characteristic k1 is stored in a first characteristic queue D1 corresponding to a first mode image, a first online characteristic k2 is stored in a second characteristic queue D2 corresponding to a second mode image, online characteristics in the first characteristic queue D1 are selected to obtain a first characteristic sequence X1, and online characteristics in the second characteristic queue D2 are selected to obtain a second characteristic sequence X2; inputting the target feature q and the first feature sequence X1 into a classification loss function to obtain a loss value L1 of a first mode; and inputting the target feature q and the second feature sequence X2 into a classification loss function to obtain a loss value L2 of the second mode. The loss value L1 of the first mode is added to the loss value L2 of the second mode to obtain a total loss value Lz.

According to the method for updating the parameters of the target network and the online network, the total loss value is calculated through the target characteristics and the characteristic sequence, and the first parameter of the target network and the second parameter of the online network are updated based on the total loss value, so that the total loss value is obtained based on the target characteristics and the characteristic sequence, and the reliability of updating the parameters of the target network and the online network is improved.

For the "updating the first parameter and the second parameter based on the total loss value" in the above alternative implementation, there are various specific implementations, for example, the first parameter and the second parameter are updated simultaneously by using the same updating algorithm. Alternatively, the first parameter and the second parameter are updated sequentially by using different updating algorithms, and specifically, in some optional implementations of the present embodiment, the updating the first parameter and the second parameter based on the total loss value may include: based on the total loss value, updating a first parameter of the target network by adopting a random gradient descent method; the first parameter is accessed by a random gradient descent method, and the second parameter of the online network is updated by adopting an exponential moving average algorithm.

In this embodiment, the online network and the target network are the same network in structure, and both have corresponding parameters, for example, a certain network location of the target network has a first parameter a, and correspondingly, the same network location of the online network has a second parameter a ', and when the first parameter a (for example, increasing or decreasing the value of the parameter a) is accessed by adopting a random gradient descent method, the second parameter a' of the online network is correspondingly updated by adopting an exponential sliding average algorithm. The parameter values of the various parameters in the online network in each iterative training need to be recorded in real time before the online network parameters are updated using an exponential moving average algorithm.

In this embodiment, the exponential sliding average algorithm (Exponential Moving Average, abbreviated as EMA) is also called a weight sliding average algorithm (Weighted Moving Average), which is an average method giving more weight to recent data, such as n (n is a natural number greater than zero) data [ θ1, θ …, θn]Then the exponential sliding average value EMA: v _t ＝β·v _t -1+(1-β)·θ _t Wherein v is _t Representing the average of the first t bars, β is a weighted weight (typically 0.9-0.999).

In the alternative implementation manner, the updating of the parameters of the target network and the online network can be performed once every time the target network performs iterative training, the online network performs second parameter updating by adopting an exponential moving average algorithm in each iterative training process, the target network performs first parameter updating by adopting random gradient descent, the network training converges to the target network, after the target network converges, the online network is only required to perform feature extraction, and in network deployment, images of all modes are not distinguished, so that the modal images of a plurality of modes can be input into a multi-mode image recognition model generated by the online network to recognize the targets in the multi-mode image.

In the alternative implementation mode, the parameters of the target network are updated through a random gradient descent method, so that gradient descent updating is carried out on the calculated gradient of the target network, the convergence of training of the target network is improved, meanwhile, the parameter updating is carried out through index sliding average, and accordingly, after the convergence of the target network is finished, the parameters of the online network are correspondingly updated, an online network suitable for identifying targets in the multi-mode image is obtained, and a reliable means is provided for obtaining the multi-mode image identification model.

Optionally, the updating the first and second parameters based on the total loss value includes: based on the total loss value, updating a first parameter of the target network by adopting a random gradient descent method; and updating a second parameter of the online network by adopting an exponential moving average algorithm under the current total loss value.

In the alternative implementation manner, the total loss value is calculated once in each iterative training process of the target network, the first parameter of the target network is updated by adopting a random gradient descent method, and the second parameter of the online network is updated by adopting an exponential sliding average algorithm, so that the timely update of the parameters of the online network along with the training of the target network can be improved.

For the "calculating the loss value of each mode based on the target feature and the feature sequence" in the above-mentioned alternative implementation manner, the loss value is obtained based on the classification condition of the target network, the loss value may be calculated by using a classification loss function, and for the two-classification problem, the classification loss function may use a negative log likelihood loss, a cross entropy loss, and an exponential loss, specifically, in some alternative implementation manners of the present embodiment, the calculating the loss value of each mode based on the target feature and the feature sequence includes: and calculating a loss value according to the classification loss function, the target feature and a second online feature in the feature sequence, wherein the second online feature at least comprises the first online feature.

In this embodiment, the feature sequence includes a second online feature, and for different iterative training times of the target network, the content of the second online feature may be different, for example, when the target network is trained for the first iteration, the second online feature includes only the first online feature; after the target network is subjected to repeated iterative training, the second online features comprise a first online feature and at least one historical online feature, the second online features in the feature sequence have positive sample and negative sample distinction relative to the target features, and when the second online features in the feature sequence and the target features belong to the same target, the second online features belong to positive samples of the target features; when the second online feature and the target feature in the feature sequence do not belong to the same target, the second online feature belongs to a negative sample of the target feature, and therefore, the classification loss of the positive and negative samples can be calculated through a classification loss function, and the classification loss is the loss value of each mode.

In this alternative implementation, the classification loss function may be a cross entropy loss function, and a multi-classification output layer (softmax as shown in fig. 2) may be added during online network and target network training, where the multi-classification output layer may be implemented by a softmax () function, so that the characteristics of different targets may be classified by the multi-classification output layer, and the probability (p ⁺ 、p ₁ ^- 、p ₂ ^- ) And the probability of positive samples can be maximized by cross entropy loss function, i.e. negative sample probability is implicitly minimized.

Specifically, the cross entropy loss function is formulated as follows:

wherein in formula (1), L _m For representing the loss value of each mode, m is a natural number greater than 2, n is the number of target features, y _i A probability value for characterizing a positive sample for a second online feature in the sequence of target features relative to the feature,the target feature output for the classification output layer is the probability of a positive sample. For example, if the target feature belongs to the same target relative to a second online feature in the feature sequence, then y _i 1, if the second online feature in the feature sequence does not belong to the same target, y _i Is 0.

According to the method for calculating the loss values of the modes, which is provided by the alternative implementation mode, the loss values of the modes are calculated through classifying the loss functions, so that reliable update basis can be provided for parameter update of the target model.

In some optional implementations of the present embodiment, the second online feature may further include: the historical online features in the feature queue are the online features of the feature queue input by the online network at the historical moment.

According to the method for calculating the loss value of each mode, which is provided by the alternative implementation mode, the second online characteristic is set to be the characteristic comprising the historical online characteristic, so that the target network has sufficient positive samples and negative samples in the training process, and the reliability of the training of the target network is improved.

For the feature queue in step 104, it may have various storage management forms, such as a last-in first-out queue, a first-in first-out queue, a queue with priority (the element with the highest priority when going in and out of the queue according to the order, if the priorities are the same, and then according to first-in first-out), and so on, in some alternative implementations of this embodiment, the feature queue is a first-in first-out queue; the step 104 may specifically include: respectively updating the first online features to the tail of a first-in first-out queue corresponding to the first online features; in response to determining that the first-in first-out queue is full, a historical online characteristic of a head of the first-in first-out queue is popped.

As shown in fig. 2, the first online feature k1 is updated to all the historical online features (w ₁₁ 、w ₁₂ ) If the first feature queue D1 is full, the historical online feature w of the head of the queue in the first feature queue D1 is popped up ₁₃ . Similarly, the first online feature k2 is updated to all historical online features (w ₂₁ 、w ₂₂ ) If the second feature queue D2 is full, the historical online feature w of the head of the queue in the second feature queue D2 is popped up ₂₃ 。

According to the method for obtaining the feature sequence of each mode, after the first online feature is obtained, the first online feature is updated to the tail of the first-in first-out queue, when the first-in first-out queue is full, the historical online feature of the head of the first-in first-out queue is popped up, so that the online feature in the feature queue can be effectively cleaned, and a reliable basis is provided for calculating the loss value.

When the storage space of the feature queues of each mode is large enough, partial online features in the feature queues can be selected to generate feature sequences, optionally, the first online features are respectively input into the corresponding feature queues, and the obtaining of the feature sequences in one-to-one correspondence comprises: storing the online features in the feature queue according to the time sequence; and selecting the second online features with the preset number (the larger the preset number is according to the model training precision requirement, the higher the model training precision) in the latest time from the feature queue to obtain the feature sequence of each mode.

Optionally, the feature queue is a first-in first-out queue, the first online features are respectively input into the corresponding feature queues, and obtaining the feature sequence with one-to-one correspondence includes: and respectively inputting the first online features into a feature queue, selecting the first online features in the feature queue and presetting a plurality of historical online features which are adjacent to the first online features in the feature queue in sequence to obtain a feature sequence.

In this embodiment, the multi-modal image recognition model may be used for a plurality of different recognition tasks, for example, pedestrian recognition and face recognition, and the selected samples are different, so that the recognition tasks to which the multi-modal image recognition model obtained by training may be applied are different.

In some optional implementations of this embodiment, the at least two modes include a color mode and a near infrared mode, and the selecting the image samples of the at least two modes from the pre-constructed multi-target multi-mode image sample set includes: and determining multi-mode image samples of the same target from a target multi-mode image sample set, and selecting a color mode image and a near infrared image from the multi-mode image samples.

In this alternative implementation, when the target is a person, the targets in the color mode image and the near infrared image both have the same person identity information. When the object is an object, the objects in the color mode image and the near infrared image have the same object identification.

The image samples of at least two modes are selected according to the identity information of the same person, the image samples of different modes are selected according to the identity information of the same person, the multi-mode image recognition model obtained through training can be used for recognizing the person features of the same person, and reliable recognition means are provided for the pedestrian recognition task of the same person in the multi-mode image.

Optionally, when the multi-mode image recognition model is used for face recognition, face images of at least two modes are selected from the pre-constructed multi-target multi-mode image sample, and the selected face images belong to the same person.

FIG. 3 illustrates a flowchart 300 of one embodiment of a multi-modality image recognition method of the present disclosure that includes the steps of:

in step 301, an image is acquired having at least two modalities.

In the present embodiment, the image having at least two modalities is an image to be recognized including images of a plurality of modalities, and the number of the at least two modalities is not limited, and the execution subject of the multi-modality image recognition method may acquire the image having at least two modalities in a plurality of ways. For example, the execution subject may acquire images having at least two modalities stored therein from the database server through a wired connection or a wireless connection. For another example, the executing body may also receive images with at least two modalities acquired in real time by the terminal or other device.

In this embodiment, the image having at least two modes may be a plurality of different forms of visible light images, such as a color image, a gray scale image, and the like, and the image having at least two modes may further include a near infrared image. And the format of the image having at least two modalities is also not limited in this disclosure.

Step 302, inputting the image into a multi-mode image recognition model obtained by adopting a multi-mode image recognition model training method to obtain the characteristics of the image.

In this embodiment, the execution body on which the multi-modal image recognition method operates may input the image with at least two modalities acquired in step 301 into the multi-modal image recognition model, so as to obtain the characteristics of the image, output by the multi-modal image recognition model, of the target in the image with at least two modalities.

In this embodiment, the features of the image are used to reflect features of each of the images of at least two modes, where the features of the image may include: the mode to which the image belongs, the information of the target in the image, the size of the image, etc.

In this embodiment, the multimodal image recognition model may be generated using the method described above in connection with the embodiment of FIG. 1. The specific generation process may be referred to in the description of the embodiment of fig. 1, and will not be described herein.

Step 303, based on the characteristics of the image, obtaining the identification result of the target in the image.

In this embodiment, the recognition result may include: the characteristic of the target in each mode image of at least two modes, the type of the target, the position of the target and other information, namely the identification result of the target in each mode image.

In this embodiment, the recognition result of the target output by the multi-mode image recognition model may further include: at least one region of interest in the image, confidence of the object in each region of interest, the type of object to which the object belongs, etc.

It should be noted that, the multi-mode image recognition method of the present embodiment may be used to test the multi-mode image recognition model generated in the foregoing embodiments, and further, the multi-mode image recognition model may be continuously optimized according to the recognition result of the target of the image output by the multi-mode image recognition model, where the recognition result of the target may include the region information and the target type in the image where the target is located, and the target may be a person, an object, a scenery, and the like in the image. The method may also be a practical application method of the multi-mode image recognition model generated in the above embodiments. By adopting the multi-mode image recognition model generated by the embodiments, the targets in the images of different modes can be accurately recognized compared with the recognition of the targets in the images by the distinction of the online network and the target network, the target types of the targets can be effectively determined, and the accuracy of the recognition of the targets in the images of different modes can be improved.

According to the multi-mode image recognition method provided by the embodiment, the image with at least one mode is obtained, and the image is input into the pre-trained multi-mode image recognition model, so that the targets in the image can be effectively recognized, and the target recognition efficiency is improved.

In order to realize identification of identity information of an object in images of at least two modalities, in some optional implementations of the present embodiment, the step 303 may specifically include: calculating the similarity between the features of the image and at least two bottom library features in the database one by one; and selecting a target corresponding to the bottom library feature with the highest similarity, and taking the identity information of the target as a recognition result of the target in the image.

In this optional implementation manner, the database is pre-stored with the features of the base marked with a plurality of different identity information, the features of the base are obtained by extracting features of targets in images of different modes, the features of the images and the features of the base are subjected to similarity comparison through a similarity comparison algorithm (wherein the similarity comparison algorithm can adopt a plurality of types, such as a Euclidean distance and a cosine distance), a similarity value is obtained, whether the features of the images are similar to the features marked by the features of the base can be determined through the similarity value, the features of the base with the largest similarity value are further selected as the features of the base with the highest similarity, the identity information of the targets in the images, which correspond to the features of the base, is determined, and the targets have the identity information of the features of the base with the highest similarity.

In the alternative implementation mode, the characteristics of the image are used for determining the characteristics of the base with highest similarity, further determining the identity information of the target in the image, and providing a reliable implementation mode for the identity information labeling of the target.

In some alternative implementations of the present embodiment, the image includes: color mode image and near infrared image, the recognition result includes: features of the object in the color mode image and features of the object in the near infrared image.

In this optional implementation manner, the features of the target are used to reflect the specific features presented by the target in the image, where the features of the target may include the mode type of the image in which the target is located, the target type, the region in which the target is located, and so on. In this alternative implementation, the modality types are two types, color mode and near infrared mode, respectively.

According to the multi-mode image recognition method provided by the alternative implementation mode, when the multi-mode image comprises the color mode image and the near infrared image, the characteristics of the target in the color mode image and the characteristics of the target in the near infrared image can be respectively determined, so that the accuracy of target recognition in the color mode image and the near infrared image is improved.

In some optional implementations of this embodiment, the target is a human face, and the recognition result includes: has facial features of different persons in at least one modal image.

In this embodiment, the targets are faces of different people, and the facial features of different people in different modal images can be identified through the multi-modal image identification model.

According to the multi-mode image recognition method provided by the alternative implementation mode, when the target is the human face, the human face characteristics of different people in the multi-mode image can be obtained through the multi-mode recognition model, and the accuracy of target recognition in each mode image in the multi-mode image in the human face recognition field is improved.

With further reference to fig. 4, as an implementation of the method illustrated in the foregoing figures, the present disclosure provides an embodiment of a multi-modal image recognition model training apparatus, which corresponds to the method embodiment illustrated in fig. 1, and which is particularly applicable in a variety of electronic devices.

As shown in fig. 4, the multi-modal image recognition model training apparatus 400 provided in this embodiment includes: the method comprises a sample selection unit 401, a target obtaining unit 402, an online obtaining unit 403, a sequence obtaining unit 404 and a training unit 405.

The sample selection unit 401 may be configured to select target image samples of at least two modes from a multi-target multi-mode image sample set constructed in advance, where the target image samples all have the same target. The target obtaining unit 402 may be configured to input any one of the modal image samples in the target image sample into the target network to perform feature extraction, so as to obtain the target feature of the target image sample. The online obtaining unit 403 may be configured to input the target image sample into an online network for feature extraction, so as to obtain a first online feature of the target image sample, where the online network and the target network have the same network structure. The sequence obtaining unit 404 may be configured to input the first online feature into a corresponding feature queue to obtain a feature sequence corresponding to the first online feature. The training unit 405 may be configured to train the multimodal image recognition model of the corresponding online network based on the target feature, the feature sequence.

In this embodiment, in the multi-modal image recognition model training apparatus 400: the specific processing of the sample selection unit 401, the target obtaining unit 402, the online obtaining unit 403, the sequence obtaining unit 404, and the training unit 405 and the technical effects thereof may refer to the relevant descriptions of step 101, step 102, step 103, step 104, and step 105 in the corresponding embodiment of fig. 1, and are not described herein.

In some optional implementations of this embodiment, the training unit 405 includes: the subunit (not shown) is updated to obtain the subunit (not shown). Wherein the updating subunit may be configured to update the first parameter of the target network and the second parameter of the online network based on the target feature, the feature sequence. The obtaining subunit may be configured to obtain the multi-modal image recognition model corresponding to the online network in response to determining that the target network meets the training completion condition, the target network being trained based on the first parameter.

In some optional implementations of this embodiment, the updating subunit includes: a modality calculation module (not shown in the figure), a loss calculation module (not shown in the figure), and an update module (not shown in the figure). The mode calculation module is configured to calculate loss values of various modes based on the target features and the feature sequences. The loss calculation module may be configured to calculate a total loss value based on the loss value. The updating module may be configured to update the first parameter and the second parameter based on the total loss value.

In some optional implementations of this embodiment, the update module may be configured to: updating the first parameter by adopting a random gradient descent method based on the total loss value; the first parameter is accessed by a random gradient descent method and the second parameter is updated using an exponential moving average algorithm.

In some optional implementations of this embodiment, the modality calculation module is further configured to: and calculating a loss value according to the classification loss function, the target feature and a second online feature in the feature sequence, wherein the second online feature at least comprises the first online feature.

In some optional implementations of this embodiment, the second aspect further includes: the historical online features in the feature queue are the online features of the feature queue input by the online network at the historical moment.

In some optional implementations of this embodiment, the feature queue is a first-in first-out queue; the above sequence obtaining unit 404 is further configured to: respectively updating the first online features to the tail of a first-in first-out queue corresponding to the first online features; in response to determining that the first-in first-out queue is full, a historical online characteristic of a head of the first-in first-out queue is popped.

In some optional implementations of this embodiment, the at least two modes include: color mode and near infrared mode.

In the multi-mode image recognition model training device provided by the embodiment of the present disclosure, first, the sample selection unit 401 selects target image samples of at least two modes from a multi-target multi-mode image sample set constructed in advance, where the target image samples all have the same target; secondly, the target obtaining unit 402 inputs any one mode image sample in the target image sample into a target network to perform feature extraction, so as to obtain target features; thirdly, the online obtaining unit 403 inputs the target image sample into an online network to perform feature extraction, so as to obtain a first online feature, wherein the online network and the target network have the same network structure; from time to time, the sequence obtaining unit 404 respectively inputs the first online features into corresponding feature queues to obtain feature sequences in one-to-one correspondence; finally, the training unit 405 trains the multimodal image recognition model corresponding to the online network based on the target feature, the feature sequence. In the multi-mode image recognition model training process, the first online features are fully referenced to train the online network, so that the online network can determine the features of target image samples of all modes, and the accuracy of the multi-mode image recognition model in recognizing the features of targets in all the mode images is improved; and the multi-modal image recognition model is trained by adopting an online network and a target network with the same network structure, and the contrast learning and the feature queue are applied to the target feature recognition of the multi-modal image, so that the multi-modal image recognition model can uniformly learn the image information of various modal images in the training, and the accuracy of target recognition is improved.

With continued reference to fig. 5, as an implementation of the method of fig. 3 described above, the present application provides an embodiment of a multi-modality image recognition system. The embodiment of the device corresponds to the embodiment of the method shown in fig. 3, and the device can be applied to various electronic devices.

As shown in fig. 5, the multi-modality image recognition apparatus 500 of the present embodiment may include: an image acquisition unit 501 is configured to acquire an image having at least one modality. A feature deriving unit 502 configured to input the image into a multi-modal image recognition model generated by the apparatus as described in the embodiment of fig. 4 above, to derive features of the image; the image recognition unit 503 is configured to obtain a recognition result of the object in the image based on the features of the image.

It will be appreciated that the elements described in the apparatus 500 correspond to the various steps in the method described with reference to fig. 3. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 500 and the units contained therein, and are not described in detail herein.

In some optional implementations of this embodiment, the image recognition unit is further configured to: calculating the similarity between the features of the image and at least two bottom library features in the database one by one; and selecting a target corresponding to the bottom library feature with the highest similarity, and taking the identity information of the target as a recognition result of the target in the image.

In some optional implementations of this embodiment, the image includes: a color mode image and a near infrared image, wherein the identification result comprises: the characteristics of the object in the color mode image and the characteristics of the object in the near infrared image.

In some optional implementations of this embodiment, the target is a human face, and the recognition result includes: has facial features of different persons in at least one modality image.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as the multimodal image recognition model training method. For example, in some embodiments, the multimodal image recognition model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by the computing unit 601, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the multimodal image recognition model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable multi-modal image recognition model training apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A multi-modal image recognition model training method, the method comprising:

selecting target image samples of at least two modes from a multi-target multi-mode image sample set constructed in advance, wherein the target image samples all have the same target;

inputting any one mode image sample in the target image sample into a target network for feature extraction to obtain target features;

inputting the target image sample into an online network for feature extraction to obtain first online features of each mode image sample in the target image sample, wherein the online network and the target network have the same network structure;

respectively inputting first online features of each mode image sample in the target image sample into a feature queue corresponding to each mode image sample to obtain a one-to-one corresponding feature sequence, wherein each feature queue in the feature queues corresponds to one mode image sample, and the number of the feature queues is the same as the number of modes in the multi-target multi-mode image sample set;

Based on the target features and the feature sequences, training a multi-mode image recognition model corresponding to the online network.

2. The method of claim 1, wherein the training a multimodal image recognition model corresponding to the online network based on the target feature, the sequence of features, comprises:

updating a first parameter of the target network and a second parameter of the online network based on the target feature and the feature sequence;

in response to determining that the target network meets a training completion condition, a multimodal image recognition model corresponding to the online network is obtained, the target network being trained based on the first parameter.

3. The method of claim 2, wherein the updating the first parameter of the target network and the second parameter of the online network based on the target feature, the sequence of features, comprises:

calculating loss values of all modes based on the target features and the feature sequences;

calculating a total loss value based on the loss value;

updating the first parameter and the second parameter based on the total loss value.

4. A method according to claim 3, wherein said updating said first parameter and said second parameter based on said total loss value comprises:

Updating the first parameter by adopting a random gradient descent method based on the total loss value;

the first parameter is accessed by the random gradient descent method, and the second parameter is updated by adopting an exponential sliding average algorithm.

5. A method according to claim 3, wherein said calculating loss values for each modality based on said target feature and said sequence of features comprises:

and calculating the loss value according to a classification loss function, the target feature and a second online feature in the feature sequence, wherein the second online feature at least comprises the first online feature.

6. The method of claim 5, wherein the second characterization further comprises: the historical online characteristics in the characteristic queue are the online characteristics of the characteristic queue input by the online network at the historical moment.

7. The method according to one of claims 1-6, wherein the feature queue is a first-in first-out queue;

the step of inputting the first online features into corresponding feature queues respectively to obtain one-to-one corresponding feature sequences, comprising the following steps:

updating the first online features to the tail of a first-in first-out queue corresponding to the first online features respectively;

In response to determining that the first-in first-out queue is full, a historical online characteristic of a head of the first-in first-out queue is popped.

8. The method of claim 1, wherein the at least two modalities comprise: color mode and near infrared mode.

9. A multi-modal image recognition method, the method comprising:

acquiring an image having at least two modalities;

inputting the image into a multi-mode image recognition model obtained by adopting the multi-mode image recognition model training method according to any one of claims 1-8 to obtain the characteristics of the image;

and obtaining a recognition result of the target in the image based on the characteristics of the image.

10. The method of claim 9, wherein the obtaining the identification result of the object in the image based on the feature of the image comprises:

calculating the similarity between the features of the image and at least two bottom library features in the database one by one; selecting a target corresponding to the bottom library feature with highest similarity, and taking the identity information of the target as a recognition result of the target in the image.

11. The method of claim 9, wherein the image comprises: a color mode image and a near infrared image, the recognition result including: the characteristics of the object in the color mode image and the characteristics of the object in the near infrared image.

12. The method of claim 9, wherein the target is a human face, the recognition result comprising: and the face features of different people in the image.

13. A multi-modal image recognition model training apparatus, the apparatus comprising:

a sample selection unit configured to select target image samples of at least two modalities from a multi-target multi-modality image sample set constructed in advance, the target image samples having the same target;

the target obtaining unit is configured to input any one of the modal image samples in the target image sample into a target network for feature extraction to obtain target features of the target image sample;

the online obtaining unit is configured to input the target image sample into an online network for feature extraction to obtain first online features of each mode image sample in the target image sample, and the online network and the target network have the same network structure;

the sequence obtaining unit is configured to input first online features of each mode image sample in the target image sample into a feature queue corresponding to each mode image sample respectively to obtain a one-to-one corresponding feature sequence, each feature queue in the feature queues corresponds to one mode image sample, and the number of the feature queues is the same as the number of modes in the multi-target multi-mode image sample set;

And the training unit is configured to train a multi-mode image recognition model corresponding to the online network based on the target features and the feature sequence.

14. The apparatus of claim 13, wherein the training unit comprises:

an updating subunit configured to update a first parameter of the target network and a second parameter of the online network based on the target feature, the feature sequence;

and a deriving subunit configured to derive a multimodal image recognition model corresponding to the online network in response to determining that the target network meets a training completion condition, the target network being trained based on the first parameter.

15. The apparatus of claim 14, wherein the update subunit comprises:

a modality calculation module configured to calculate loss values for respective modalities based on the target feature and the feature sequence;

a loss calculation module configured to calculate a total loss value based on the loss value;

an updating module configured to update the first parameter and the second parameter based on the total loss value.

16. The apparatus of claim 15, wherein the update module is further configured to: updating the first parameter by adopting a random gradient descent method based on the total loss value; the first parameter is accessed by the random gradient descent method, and the second parameter is updated by adopting an exponential sliding average algorithm.

17. The apparatus of claim 15, wherein the modality calculation module is further configured to: and calculating the loss value according to a classification loss function, the target feature and a second online feature in the feature sequence, wherein the second online feature at least comprises the first online feature.

18. The apparatus of claim 17, wherein the second characterization further comprises: the historical online characteristics in the characteristic queue are the online characteristics of the characteristic queue input by the online network at the historical moment.

19. The apparatus of one of claims 13-18, wherein the feature queue is a first-in first-out queue; the sequence deriving unit is further configured to: updating the first online features to the tail of a first-in first-out queue corresponding to the first online features respectively; in response to determining that the first-in first-out queue is full, a historical online characteristic of a head of the first-in first-out queue is popped.

20. The apparatus of claim 13, wherein the at least two modalities comprise: color mode and near infrared mode.

21. A multi-modality image recognition device, the device comprising:

An image acquisition unit configured to acquire an image having at least two modalities;

a feature obtaining unit configured to input the image into a multi-modality image recognition model obtained by the multi-modality image recognition model training device according to any one of claims 13 to 20, to obtain features of the image;

and the image recognition unit is configured to obtain a recognition result of the target in the image based on the characteristics of the image.

22. The apparatus of claim 21, wherein the image recognition unit is further configured to: calculating the similarity between the features of the image and at least two bottom library features in the database one by one; selecting a target corresponding to the bottom library feature with highest similarity, and taking the identity information of the target as a recognition result of the target in the image.

23. The apparatus of claim 21, wherein the image comprises: a color mode image and a near infrared image, the recognition result including: the characteristics of the object in the color mode image and the characteristics of the object in the near infrared image.

24. The apparatus of claim 21, wherein the target is a human face, the recognition result comprising: and the face features of different people in the image.

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-12.