CN114663737A

CN114663737A - Object identification method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114663737A
Application number: CN202210546400.7A
Authority: CN
Inventors: 李晓川; 赵雅倩; 李仁刚; 郭振华; 范宝余
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-06-24
Anticipated expiration: 2042-05-20
Also published as: CN114663737B

Abstract

The invention discloses an object identification method, an object identification device, electronic equipment and a computer readable storage medium, and relates to the field of pattern identification, wherein when a target image and a candidate image which contain an object and have uncertain modes are obtained, the method can extract interactive features of the images by using a self-attention feature extraction mode and a cross-attention feature extraction mode so as to obtain an interactive feature matrix corresponding to the images, so that the pertinence of an attention mechanism in dealing with the problem of weight identification of an object in a non-determined cross-mode can be effectively improved; in addition, the two interactive feature matrixes can be merged into a probability space, a predicted value capable of representing the probability that the target image and the candidate image belong to the same object is generated in the space, namely, a probability prediction method can be adopted, the problem of re-recognition of the non-determined cross-modal object is solved in the probability space, and therefore the accuracy of the re-recognition of the non-determined cross-modal object can be effectively improved.

Description

Object identification method and device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of pattern recognition, and in particular, to an object recognition method, an object recognition apparatus, an electronic device, and a computer-readable storage medium.

Background

The problem of Cross-Modal (Cross Modal) object re-recognition, which can be described simply as: whether the objects included in the two images in different modalities are the same object is determined, for example, whether the objects included in the visible light image and the infrared light image are the same object is determined. The existing cross-modal object weight identification method is based on the assumption of determining cross-modal, that is, an object to be identified is in multiple modalities, however, in practical application, the appearance of the object in each modal image is uncertain, for example, the object may appear in a visible light image, an infrared light image or in both images, and further whether the object contains multiple modalities cannot be judged, which leads to the fact that the existing cross-modal object weight identification method cannot solve the problem of non-determined cross-modal object weight identification fundamentally. From another perspective, the existing cross-modality object weight recognition method generally uses a feature space distance to perform cross-modality object weight recognition, for example, to extract features of a visible light image and features of an infrared light image, and calculate a distance between the two features, so as to determine whether the two images belong to the same object by using the distance. However, since the distance between the homomorphic images is naturally shorter than the distance between the heteromorphic images, the non-deterministic cross-modal object recognition problem is difficult to realize in the re-recognition architecture based on the feature space distance. For example, for a pedestrian a and a pedestrian B who also wear red clothes, a situation may occur in which the feature space distance between the visible light images of both of them is closer than the feature space between the trans-modal images of the pedestrian a itself, and thus a problem of an object re-recognition error is easily caused. Obviously, the existing cross-modal object weight identification method cannot solve the problem of non-determined cross-modal object weight identification fundamentally.

Therefore, for those skilled in the art, how to effectively improve the accuracy of the weight recognition of the non-deterministic cross-modal object is a technical problem to be solved in the art.

Disclosure of Invention

The invention aims to provide an object identification method, an object identification device, an electronic device and a computer-readable storage medium, which can process the re-identification problem of a non-determined cross-modal object in a probability space through a self-attention feature extraction mode, a cross-attention feature extraction mode and a probability prediction method, so that the accuracy of the weight identification of the non-determined cross-modal object can be effectively improved.

In order to solve the above technical problem, the present invention provides an object identification method, including:

cutting, feature extraction and coding processing are carried out on the obtained object image to be processed, and a coding matrix corresponding to the object image to be processed is obtained; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed;

inputting the coding matrix into two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;

inputting the interactive feature matrix into two feature extraction branches in parallel, so that the feature extraction branches extract the self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed;

inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector;

and judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.

Optionally, the performing, by the feature interaction branch, self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed includes:

the feature interaction branch circuit extracts the self-attention feature of the coding matrix, and adds the obtained self-attention feature and the coding matrix to obtain a local end intermediate feature;

sending the local terminal intermediate feature to another feature interaction branch, and receiving an opposite terminal intermediate feature sent by the other feature interaction branch;

and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain the interactive feature matrix.

Optionally, the performing cross-attention feature extraction on the local-end intermediate feature and the opposite-end intermediate feature includes:

the feature interaction branch circuit performs attention-crossing feature extraction on the local terminal middle feature and the opposite terminal middle feature in the following mode:

wherein

Representing an initial cross-attention feature, said

Representing said local intermediate feature, said

Representing said peer intermediate feature, said

Represents a normalization function, said

The above-mentioned

And said

Representing a pre-trained weight matrix, said

Represents a transpose operation, the

Representing the dimension of the opposite-end intermediate feature;

and carrying out random deletion processing and normalization processing on the initial cross-attention feature to obtain the cross-attention feature.

Optionally, the feature interaction branch has a multilayer structure, and after obtaining the interaction feature matrix, the method further includes:

judging whether a next layer of feature interaction branch exists or not;

if yes, inputting the interactive feature matrix into the next layer of feature interactive branch for processing;

and if not, inputting the interactive feature matrix into the two feature extraction branches in parallel.

Optionally, the cutting, feature extraction, and encoding the obtained image of the object to be processed to obtain an encoding matrix corresponding to the image of the object to be processed includes:

cutting the object image to be processed to obtain an image block corresponding to the object image to be processed;

generating an image set by using the object image to be processed and the image blocks, and performing the feature extraction on the image set by using a neural network corresponding to the modal class of the object image to be processed to obtain a feature matrix corresponding to the object image to be processed;

and performing the coding processing on the characteristic matrix by using the modal class and the cropping characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.

Optionally, the cutting the object image to be processed to obtain an image block corresponding to the object image to be processed includes:

cutting the object image to be processed according to a first preset cutting line number and a transverse cutting mode to obtain a first image block;

cutting the object image to be processed according to a second preset cutting line number, a preset cutting line number and a horizontal and vertical cutting mode to obtain a second image block;

setting the first image block and the second image block as the image blocks.

Optionally, the encoding processing is performed on the feature matrix by using the modality category and the cropping feature information of each image block to obtain an encoding matrix corresponding to the object image to be processed, and the encoding processing includes:

acquiring a cutting mode corresponding to the image block, a relative position in the object image to be processed and a corresponding feature vector in the feature matrix;

generating a position code by using the relative position and a feature code in the feature vector, and generating a cropping code and a modality code by using the cropping mode and the modality category respectively;

and generating the coding matrix by utilizing the feature code, the position code, the cropping code and the modal code.

Optionally, the performing, by using a neural network corresponding to a modality category of the object image to be processed, the feature extraction on the image set includes:

and scaling each image block in the image set to a preset size, and inputting the image set subjected to scaling processing into the neural network for feature extraction.

Optionally, before inputting the initial prediction vector generated by using the interaction feature matrix and the intermediate feature into the prediction branch, the method further includes:

and calculating cosine similarity between the interactive feature matrixes, and generating the initial prediction vector by using the cosine similarity.

Optionally, the performing, by the prediction branch, self-attention feature extraction on the initial prediction vector, and performing cross-attention feature extraction by using the obtained intermediate prediction feature and the intermediate feature to obtain a prediction vector, includes:

the prediction branch inputs the initial prediction vector to a self-attention module for self-attention feature extraction, and adds the obtained self-attention feature and the initial prediction vector to obtain an intermediate prediction feature;

receiving a first intermediate feature sent by the first feature extraction branch and a second intermediate feature sent by the second feature extraction branch;

performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature;

summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature;

and inputting the fusion features into a normalization layer for processing to obtain the prediction vector.

Optionally, the feature extraction branch and the prediction branch have a multilayer structure, and after obtaining the intermediate prediction feature, the method further includes:

the prediction branch sends the intermediate prediction features to the feature extraction branch;

correspondingly, after obtaining the intermediate features corresponding to the object image to be processed, the method further includes:

the feature extraction branch circuit performs attention-crossing feature extraction by using the intermediate features and the received intermediate prediction features to obtain interlayer features;

sending the interlayer features to a next layer of feature extraction branch;

correspondingly, after obtaining the prediction vector, the method further includes:

and the prediction branch sends the prediction vector to a next layer prediction branch.

The present invention also provides an object recognition apparatus comprising:

the characteristic extraction module is used for cutting, extracting characteristics and coding the obtained object image to be processed to obtain a coding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed;

the feature interaction module is used for inputting the coding matrix into two feature interaction branches in parallel so that the feature interaction branches can perform self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;

the feature extraction branch module is used for inputting the interactive feature matrix into two feature extraction branches in parallel so that the feature extraction branches can extract the self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed;

the prediction branch module is used for inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so as to enable the prediction branch to extract self-attention features of the initial prediction vector and extract cross-attention features by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector;

and the judging module is used for judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.

The present invention also provides an electronic device comprising:

a memory for storing a computer program;

a processor for implementing the object identification method as described above when executing the computer program.

The present invention also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the object identification method as described above is implemented.

The invention provides an object identification method, which comprises the following steps: cutting, feature extraction and coding processing are carried out on the obtained object image to be processed, and a coding matrix corresponding to the object image to be processed is obtained; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed; inputting the coding matrix into two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed; inputting the interactive feature matrix into two feature extraction branches in parallel so that the feature extraction branches can extract self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the to-be-processed object image; inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector; and judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.

Therefore, when a target image and a candidate image containing an object are obtained, the object image to be processed is cut, subjected to feature extraction and subjected to encoding processing to obtain an encoding matrix corresponding to the object image to be processed, wherein the encoding matrix contains various kinds of feature information of the corresponding image; furthermore, the coding matrix is input into two identical characteristic interaction branches to carry out self-attention characteristic extraction and cross-attention characteristic extraction, so that an interaction characteristic matrix corresponding to the object image to be processed is obtained, wherein the branch can carry out cross-attention characteristic extraction on the coding matrix received at the local end by utilizing the self-attention characteristics of the two coding matrices, and the pertinence of an attention mechanism in response to the problem of non-determined cross-mode object weight identification can be effectively improved; furthermore, the invention can generate the initial prediction vector between the interactive feature matrixes of the two images of the object to be processed, and input the interactive feature matrixes into two same feature extraction branches in parallel, inputting the initial prediction vector into a prediction branch of the two-span predictor so that the feature extraction branch adopts a self-attention mechanism to extract the self-attention features of the interactive feature matrix, and fusing the self-attention features by the prediction branch by adopting a cross-attention mechanism, generating a prediction vector according to the obtained fusion features, finally reducing the dimension of the prediction vector to obtain a prediction value capable of representing the probability that the target image and the candidate image belong to the same object, namely, a probability prediction method can be adopted to process the problem of re-identification of the non-determined cross-modal object in a probability space, and the accuracy of the re-identification of the non-determined cross-modal object can be effectively improved. The invention also provides an object recognition device, electronic equipment and a computer readable storage medium, which have the beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of an object identification method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a feature interactor according to an embodiment of the present invention;

FIG. 3 is a block diagram of a dual-span predictor according to an embodiment of the present invention;

FIG. 4 is a block diagram of another dual-span predictor provided by the embodiment of the present invention;

FIG. 5 is a schematic diagram of image cropping according to an embodiment of the present invention;

fig. 6 is a block diagram of an object recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The existing cross-modal object weight identification method is based on the assumption of determining cross-modal, namely, an object to be identified is in multiple modes, so that the existing cross-modal object weight identification method cannot solve the problem of non-determined cross-modal object weight identification fundamentally. In view of the above, the present invention provides an object identification method, which can process the re-identification problem of the non-determined cross-modal object in the probability space through a self-attention feature extraction method, a cross-attention feature extraction method and a probability prediction method, so as to effectively improve the accuracy of the weight identification of the non-determined cross-modal object. Referring to fig. 1, fig. 1 is a flowchart of an object identification method according to an embodiment of the present invention, where the method includes:

s101, performing cutting, feature extraction and coding processing on the obtained object image to be processed to obtain a coding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various characteristics of the object image to be processed.

It is understood that the target image and the candidate image should include the object to be measured, and the final purpose of the embodiment of the present invention is to determine whether the object to be measured in the candidate image is the same as the object to be measured in the target image. The present invention is not limited to specific objects to be measured, and may be, for example, pedestrians, vehicles, and the like. It should be noted that the target image and the candidate image in the embodiment of the present invention are in a non-specific modality state, that is, in brief, the corresponding acquisition manner of the target image and the candidate image is uncertain. For example, the target image and the candidate image may be acquired by a visible light image acquisition method, an infrared light image acquisition method, or a visible light image acquisition method and an infrared light image acquisition method, respectively. Of course, the collection mode is not limited to the two modes, and more modes can be adopted for collection, and the collection mode can be selected according to the actual application requirements. It should be noted that, for convenience of description, the target image and the candidate image are collectively referred to as an object image to be processed in the embodiment of the present invention.

Further, when the object images to be processed are obtained, the embodiments of the present invention perform cropping, feature extraction, and encoding on the images to obtain an encoding matrix corresponding to each object image to be processed, where the encoding matrix includes various features of the object image to be processed, such as image features, modal features, and cropping features. The following briefly introduces various processes:

the purpose of the cropping processing is to segment the object image to be processed into a plurality of image blocks so as to extract the local features of the object image to be processed by using each image block. The embodiment of the present invention does not limit a specific cutting manner as long as the object of the cutting process can be satisfied.

The feature extraction is the traditional image feature extraction processing, namely, the object image to be processed and each image block corresponding to the object image are input into the neural network model for image feature extraction, and a feature matrix corresponding to each image is obtained. It should be noted that, in the embodiment of the present invention, a neural network model specifically required to be used in the feature extraction process is not limited, and any existing convolutional neural network model (such as reseest, resenxt, and EfficientNet) can meet application requirements and can be selected according to actual needs. In addition, it should be noted that, because the modality states of the images of the object to be processed are different, the images need to be input into the corresponding neural network for feature extraction according to the modality type corresponding to each image, for example, a visible light image is input into the neural network corresponding to the visible light type for feature extraction, and an infrared light image is input into the neural network corresponding to the infrared light type for feature extraction. It should be noted that, the embodiment of the present invention does not limit the specific neural network corresponding to each modality, and may be set according to the actual application requirement.

The purpose of the encoding processing is to further fuse other features of the object image to be processed, such as modal features and cropping features, on the feature matrix based on the feature extraction. The embodiment of the invention sets corresponding code values for each mode type and cutting mode, and fuses the code values into the characteristic matrix of the object image to be processed. It should be noted that, the embodiment of the present invention does not limit the specific code value, as long as the mode type and the cropping mode can be distinguished, and the setting can be performed according to the actual application requirement. Of course, the code value may be further set in combination with the eigenvalue in the eigenvalue matrix.

S102, inputting the coding matrix into the two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix, and an interaction feature matrix corresponding to the to-be-processed object image is obtained.

For easy understanding, please refer to fig. 2, fig. 2 is a block diagram illustrating a feature interaction device according to an embodiment of the present invention. In the figure, the area enclosed by the dashed box represents a Cross-attention layer, which includes two left and right feature interaction branches, each of which includes a Self-attention subunit composed of a Self-attention module (Self-attention), a random deletion module (Dropout), a layer Normalization module (Normalization), and an addition module, and a Cross-attention subunit composed of a Cross-attention module (Cross-attention), a random deletion module, a layer Normalization module, and an addition module. For convenience of description, the feature interaction branch on the Q coding feature side is hereinafter referred to as a feature interaction branch 1, and the feature interaction branch on the G coding feature side is hereinafter referred to as a feature interaction branch 2. Taking a single-layer cross-attention layer as an example, for the feature interaction branch 1, after receiving the Q coding feature, the branch first performs self-attention feature extraction on the matrix by using a self-attention module, a random deletion module and a layer normalization module, and inputs the obtained self-attention feature and the coding matrix to an addition module for addition processing, so as to obtain an intermediate feature 1. The feature interaction branch 2 also performs the intermediate feature generation process described above to obtain the intermediate feature 2. Subsequently, the feature interaction branch 1 acquires an intermediate feature 2 from the feature interaction branch 2, inputs the intermediate feature 1 and the intermediate feature 2 into the attention-crossing module, the random deletion module and the layer normalization module together to extract the attention-crossing feature of the intermediate matrix 1, and inputs the acquired attention-crossing feature and the intermediate matrix 1 into the addition module for addition processing to obtain a Q interaction feature corresponding to the target image. The feature interaction branch 2 also executes the above-mentioned interactive feature generation processing process to obtain G interactive features corresponding to the candidate images.

In a possible case, the feature interaction branch performs self-attention feature extraction and cross-attention feature extraction on the encoding matrix to obtain an interaction feature matrix corresponding to the object image to be processed, and may include:

step 11: the feature interaction branch circuit extracts the self-attention feature of the coding matrix, and adds the obtained self-attention feature and the coding matrix to obtain a local end intermediate feature;

step 12: sending the intermediate characteristic of the local terminal to another characteristic interaction branch, and receiving the intermediate characteristic of the opposite terminal sent by the other characteristic interaction branch;

step 13: and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain an interactive feature matrix.

It can be understood that, for the above feature interaction branch 1, the middle feature at the home end is the middle feature 1, and the middle feature at the opposite end is the middle feature 2; for the feature interaction branch 2, the above correspondence is reversed.

Further, for the extraction of the self-attention feature, the existing self-attention mechanism can be adopted for extraction:

wherein,

the self-attention feature is shown to be,

the matrix input from the attention module is represented in this formula,

the expression of the normalization function is used,

、

and

represents a pre-trained weight matrix that is,

it is shown that the transpose operation,

representing the dimensions of the matrix input from the attention module.

For the cross-attention feature extraction, the embodiment of the present invention may perform the following processing:

in one possible case, the performing the cross-attention feature extraction on the local-end intermediate feature and the opposite-end intermediate feature may include:

step 21: the feature interaction branch circuit performs attention-crossing feature extraction on the local terminal middle feature and the opposite terminal middle feature in the following mode:

wherein

The initial cross-attention feature is represented,

the local-end intermediate features are represented in the present formula,

the characteristics of the middle of the opposite end are shown,

the expression of the normalization function is used,

、

and

represents a pre-trained weight matrix that is,

which represents the operation of the transpose(s),

representing the dimension of the middle feature of the opposite end;

step 22: and carrying out random deletion processing and normalization processing on the initial cross-attention features to obtain the cross-attention features.

It should be noted that, the embodiments of the present invention do not limit the specific structures and uses of the random deletion module, the layer normalization module, and the addition module, and these modules are all common modules in the field of neural networks, and refer to related technologies of neural networks.

Further, it should be noted that, in the embodiment of the present invention, the feature interaction branch (i.e., the attention-crossing layer) may be a single-layer structure, or may be a multilayer structure as shown in fig. 2, and the number of feature extractions can be effectively increased by setting the multilayer feature interaction branch, so as to improve the recognition performance of the re-recognition system. It should be noted that, the embodiment of the present invention does not limit the specific number of layers of the feature interaction branches, and may be set according to the actual application requirements.

In one possible case, the feature interaction branch has a multi-layer structure, and after obtaining the interaction feature matrix, the method further includes:

step 31: judging whether a next layer of feature interaction branch exists or not; if yes, go to step 32; if not, go to step 33;

step 32: inputting the interactive feature matrix into a next layer of feature interactive branch for processing;

step 33: and entering a step of inputting the interactive feature matrix into the two feature extraction branches in parallel.

S103, inputting the interactive feature matrix into the two feature extraction branches in parallel, so that the feature extraction branches can extract the attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed.

As mentioned before, for the non-deterministic cross-modality re-identification problem, it is not reasonable to perform distance calculation on the features in the feature space, since for the images with the same modality, the feature distance in the feature space is obviously shorter than the feature distance between the images without the same modality, which will seriously affect the processing effect of the non-deterministic cross-modality re-identification problem. Therefore, the embodiment of the invention abandons the use of a classical characteristic metric learning method in the field of re-identification, and uses a probability prediction method instead to process the problem of non-deterministic cross-modal re-identification in a probability space. The probability space is independent of the modalities, so that the difference between different modalities does not need to be considered, and better re-identification effect can be achieved compared with the existing scheme.

For easy understanding, please refer to fig. 3, fig. 3 is a block diagram illustrating a structure of a dual-stride predictor according to an embodiment of the present invention, wherein an area surrounded by a dashed box represents a single dual-stride prediction layer, two leftmost branches and two rightmost branches are feature extraction branches, and a middle branch is a prediction branch. In the embodiment of the invention, the feature extraction branch comprises a self-attention module, a layer normalization module and an addition module, and is used for performing self-attention feature extraction on the received interactive feature matrix, adding the obtained self-attention feature and the interactive feature to obtain an intermediate feature, and finally sending the intermediate feature to the cross-attention layer of the prediction branch. For the process of performing the self-attention feature extraction by the feature extraction branch, reference may be made to the foregoing embodiments, which are not described herein again.

And S104, inputting the initial prediction vector and the intermediate features generated by the interactive feature matrix into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain the prediction vector.

Referring also to fig. 3, it can be seen that the prediction branch consists of a self-attention unit consisting of a self-attention module, a layer normalization module, and an addition module, and a dual-cross-attention unit consisting of a cross-attention module, an attention-averaging module, a layer normalization and addition module. Wherein the input from the attention unit is an initial prediction vector having a magnitude of

Wherein

And outputting the feature dimension of the convolution neural network model to a single image in the feature extraction processing. It should be noted that the embodiment of the present invention does not limit the generation manner of the initial prediction vector, for example, one size is set as

The zero vector of (2) is set as an initial prediction vector, and the cosine similarity between two interactive feature vectors can also be calculated and copied

And obtaining an initial prediction vector. In the embodiment of the invention, in order to improve the calculation efficiency, the initial prediction vector can be generated by utilizing the cosine similarity between the interactive feature matrixes.

In one possible case, before inputting the initial prediction vector generated by using the interactive feature matrix and the intermediate features into the prediction branch, the method may further include:

step 41: and calculating cosine similarity between the interactive feature matrixes, and generating an initial prediction vector by using the cosine similarity.

Further, when the prediction branch acquires the initial prediction vector, the self-attention feature extraction is carried out on the prediction branch, and the obtained self-attention feature and the initial prediction are added to obtain an intermediate prediction feature. Subsequently, the branch can obtain intermediate features generated by the two feature extraction branches, and the two intermediate features are respectively utilized to perform attention-crossing feature extraction on the intermediate prediction features to obtain two attention-crossing features. The embodiment of the invention performs prediction by fusing the two attention-crossing features, specifically, performs summation and average processing on the two attention-crossing features to obtain a fused feature, and performs normalization processing on the fused feature to obtain a prediction vector.

In a possible case, the performing, by the prediction branch, self-attention feature extraction on the initial prediction vector and performing cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain the prediction vector may include:

step 51: the prediction branch inputs the initial prediction vector to a self-attention module for self-attention feature extraction, and adds the obtained self-attention feature and the initial prediction vector to obtain an intermediate prediction feature;

step 52: receiving a first intermediate feature sent by the first feature extraction branch and a second intermediate feature sent by the second feature extraction branch;

step 53: performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature;

step 54: and summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature.

The fusion process for the cross-attention feature can be expressed as:

wherein,

a first cross-attention feature is represented,

representing the second cross-attention feature, left of the equation

Representing the fusion characteristics, appearing to the right of the equation

Both represent intermediate prediction features.

Step 55: and inputting the fusion features into a normalization layer for processing to obtain a prediction vector.

Of course, the two-span predictor may also have a multi-layer structure. Referring to fig. 4, fig. 4 is a block diagram illustrating another dual-span predictor according to an embodiment of the present invention. Therefore, in the multilayer double-span scorer, the attention-span module, the layer normalization module and the addition module are additionally added in the feature extraction branch circuit and are used for extracting the attention-span features. And the prediction branch also needs to additionally send intermediate prediction features to the feature extraction branch so that the feature extraction branch can perform attention-crossing feature extraction. It should be noted that, because the workflow of the feature extraction branch is similar to that of the feature interaction branch, for the workflow of the feature extraction branch having a multilayer structure, reference may be made to the related description of the feature interaction branch.

In a possible case, the feature extraction branch and the prediction branch have a multi-layer structure, and after obtaining the intermediate prediction feature, the method may further include:

step 61: the prediction branch sends the intermediate prediction features to the feature extraction branch;

step 62: the feature extraction branch utilizes the intermediate features and the received intermediate prediction features to carry out attention-crossing feature extraction to obtain interlayer features;

and step 63: sending the interlayer characteristics to a next layer of characteristic extraction branch;

correspondingly, after obtaining the prediction vector, the method further comprises:

step 64: and the prediction branch sends the prediction vector to the next layer of prediction branch.

And S105, judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.

It should be noted that, the embodiment of the present invention does not limit the specific dimension reduction manner of the prediction vector, and may refer to the related technology of the neural network. It can be understood that, after the predicted value is obtained, a preset threshold value can be used for comparing with the predicted value, and whether the target image and the candidate image belong to the same object or not is determined according to the comparison result.

Based on the embodiment, when the target image and the candidate image containing the object are obtained, the object images to be processed are cut, feature extracted and encoded to obtain the encoding matrix corresponding to the object images to be processed, wherein the encoding matrix contains various feature information of the corresponding images; furthermore, the coding matrix is input into two identical characteristic interaction branches to carry out self-attention characteristic extraction and cross-attention characteristic extraction, so that an interaction characteristic matrix corresponding to the object image to be processed is obtained, wherein the branch can carry out cross-attention characteristic extraction on the coding matrix received at the local end by utilizing the self-attention characteristics of the two coding matrices, and the pertinence of an attention mechanism in response to the problem of non-determined cross-mode object weight identification can be effectively improved; furthermore, the invention can generate the initial prediction vector between the interactive feature matrixes of the two images of the object to be processed, and input the interactive feature matrixes into two same feature extraction branches in parallel, inputting the initial prediction vector into a prediction branch of the two-span predictor so that the feature extraction branch adopts a self-attention mechanism to extract the self-attention features of the interactive feature matrix, and fusing the self-attention features by the prediction branch by adopting a cross-attention mechanism, generating a prediction vector according to the obtained fusion features, finally reducing the dimension of the prediction vector to obtain a prediction value capable of representing the probability that the target image and the candidate image belong to the same object, namely, a probability prediction method can be adopted to process the problem of re-identification of the non-determined cross-modal object in a probability space, and the accuracy of the re-identification of the non-determined cross-modal object can be effectively improved.

Based on the above-described embodiment, the following describes in detail the processes of cropping, feature extraction, and encoding of an image of an object to be processed. In a possible case, the cutting, feature extraction, and encoding processing are performed on the obtained object image to be processed to obtain an encoding matrix corresponding to the object image to be processed, and the method may include:

s201, cutting the object image to be processed to obtain an image block corresponding to the object image to be processed.

It should be noted that, the embodiment of the present invention does not limit a specific cropping manner, for example, the object image to be processed may be transversely cropped to obtain an image block with a preset number of cropping lines; or performing horizontal and vertical cutting on the object image to be processed according to the preset number of rows and the preset number of columns to obtain an image block. Of course, the object image to be processed may also be cropped in a plurality of cropping manners, so as to extract the local features of the object image to be processed by using the image blocks with different sizes.

In a possible case, the cutting the object image to be processed to obtain an image block corresponding to the object image to be processed may include:

step 71: cutting the object image to be processed according to the first preset cutting line number and the transverse cutting mode to obtain a first image block;

step 72: cutting the object image to be processed according to a second preset cutting line number, a preset cutting column number and a horizontal-vertical cutting mode to obtain a second image block;

step 73: the first image block and the second image block are set as image blocks.

It should be noted that, the embodiment of the present invention does not limit the specific values of the first preset trimming line number, the second preset trimming line number, and the preset trimming column number, and the specific values may be set according to the actual application requirements. In addition, a plurality of groups of second preset cutting line numbers and preset cutting line numbers can be set for carrying out horizontal and vertical cutting on the object image to be processed so as to obtain second image blocks with different sizes. Referring to fig. 5, fig. 5 is a schematic diagram of image cropping according to an embodiment of the present invention, in which the leftmost image is an original image, the middle image is a cropped image, 1-8 are used to mark image blocks, and the image block shown in the rightmost cross section is the first image block; and the block truncation A and the block truncation B are second image blocks obtained by cutting according to different preset cutting line numbers and preset cutting column numbers. Of course, it should be noted that the image blocks marked by 1-8 are only used to illustrate that the size of the image block can be adjusted at will, and are not used to limit the cropping mode, and the specific cropping mode can be set according to the actual application requirements. After the above-mentioned cropping, adding the original image itself to obtain

Each image block, wherein m represents the number of horizontal cuts;

and

respectively representing the preset row number corresponding to the kth horizontal and vertical cutting mode.

S202, generating an image set by using the object image to be processed and the image block, and performing feature extraction on the image set by using a neural network corresponding to the modality category of the object image to be processed to obtain a feature matrix corresponding to the object image to be processed.

It will be understood that the image set comprises

And each image block. After extracting the features of the image blocks, the neural network generates a block with the size of

The feature matrix of (1), wherein

The output feature dimensions for a single image block for a neural network. It should be noted that, due to the limitation of the neural network, before feature extraction, each image block in the image set needs to be scaled to a uniform preset size. It should be noted that, the embodiment of the present invention is not limited to a specific preset size, and can be set according to the actual application requirement.

In one possible case, the feature extraction of the image set by using the neural network corresponding to the modality category of the object image to be processed may include:

step 81: and zooming each image block in the image set to a preset size, and inputting the image set subjected to zooming into a neural network for feature extraction.

And S203, coding the characteristic matrix by using the mode type and the cropping characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.

The coding matrix and the feature matrix have the same size and can be generated in the following way:

wherein

Is shown as

The coding sequence corresponding to each image block,

representing the corresponding feature vector of the image block in the feature matrix,

a position code representing the image block is encoded,

a cropping code representing the image block,

a modal encoding representing the image block. The generation of each type of code is as follows:

in a possible case, the encoding processing of the feature matrix by using the modality category and the cropping feature information of each image block to obtain an encoding matrix corresponding to the object image to be processed may include:

step 91: and acquiring a cutting mode corresponding to the image block, a relative position in the object image to be processed and a corresponding feature vector in the feature matrix.

And step 92: and generating a position code by using the relative position and the feature codes in the feature vectors, and generating a cropping code and a mode code by using a cropping mode and a mode category respectively.

Specifically, the position code can be generated as follows:

wherein

And

the coding method respectively corresponds to the coding of an image block in two directions of vertical and horizontal directions in space, and the formula is as follows:

wherein,

representing the coding formula of the image block when the index value is even,

representing an encoding formula of the image block when the index value is odd, d is an output characteristic dimension of an encoder, N represents a natural number set, pos represents a position index of the image block in the horizontal/vertical direction, and c represents the c-th characteristic code of the image block in a characteristic vector.

Further, the crop code may be generated as follows:

wherein

、

、

Respectively representing three cutting modes of no cutting, transverse cutting and transverse and vertical cutting, wherein 0, 1 and 2 are respectively the code values corresponding to the three cutting modes. It should be noted that the code value corresponding to each clipping modeCan be set according to the actual application requirements as long as the cutting modes can be distinguished.

Further, the modal encoding may be generated as follows:

wherein I represents an image block and wherein I represents an image block,

a visible light image is represented by a visible light image,

representing an infrared light image, 0 and 1 represent the encoded values in the visible and infrared modes, respectively. It should be noted that the mode encoding can be set according to the actual application requirement, as long as various modes can be distinguished.

Step 93: and generating a coding matrix by using the characteristic coding, the position coding, the cropping coding and the modal coding.

Based on the embodiment, the invention can generate the coding matrix containing the global feature, the local feature, the cropping feature and the modal feature for the object image to be processed through cropping, feature extraction and coding processing, so that various features can be fused, and the accuracy of weight identification of the non-determined cross-modal object can be further improved.

In the following, the object recognition device, the electronic device, and the computer-readable storage medium according to the embodiments of the present invention are introduced, and the object recognition device, the electronic device, and the computer-readable storage medium described below and the object recognition method described above may be referred to in correspondence.

Referring to fig. 6, fig. 6 is a block diagram of an object recognition apparatus according to an embodiment of the present invention, where the apparatus may include:

the feature extraction module 601 is configured to perform cutting, feature extraction, and encoding on the obtained object image to be processed to obtain an encoding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various characteristics of the object image to be processed;

the feature interaction module 602 is configured to input the coding matrix to the two feature interaction branches in parallel, so that the feature interaction branches perform self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;

the feature extraction branch module 603 is configured to input the interaction feature matrix to the two feature extraction branches in parallel, so that the feature extraction branch performs self-attention feature extraction on the interaction feature matrix to obtain an intermediate feature corresponding to the object image to be processed;

a prediction branch module 604, configured to input an initial prediction vector and an intermediate feature generated by using the interaction feature matrix to a prediction branch, so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction feature and the intermediate feature to obtain a prediction vector;

and a determining module 605, configured to determine whether the target image and the candidate image belong to the same object by using a predicted value obtained by dimensionality reduction of the prediction vector.

Optionally, the feature interaction module 602 may include:

the characteristic interaction branch is used for extracting the self-attention characteristic of the coding matrix and adding the obtained self-attention characteristic and the coding matrix to obtain a local end intermediate characteristic; sending the local terminal intermediate feature to another feature interaction branch, and receiving the opposite terminal intermediate feature sent by another feature interaction branch; and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain an interactive feature matrix.

Optionally, the feature interaction branch is specifically configured to perform cross-attention feature extraction on the local end intermediate feature and the opposite end intermediate feature by using the following method:

wherein

The initial cross-attention feature is represented,

the intermediate characteristics of the home terminal are shown,

the characteristics of the middle of the opposite end are shown,

the function of the normalization is expressed as,

、

and

represents a pre-trained weight matrix that is,

it is shown that the transpose operation,

representing the dimension of the middle feature of the opposite end; and carrying out random deletion processing and normalization processing on the initial cross-attention features to obtain the cross-attention features.

Optionally, the feature interaction branch has a multi-layer structure, and the feature interaction module 602 may further include:

the judging submodule is used for judging whether a next layer of characteristic interaction branch exists or not; if yes, inputting the interactive feature matrix into a next layer of feature interactive branch for processing; and if not, inputting the interactive feature matrix into the two feature extraction branches in parallel.

Optionally, the feature extraction module 601 may include:

the cutting submodule is used for cutting the object image to be processed to obtain an image block corresponding to the object image to be processed;

the characteristic extraction submodule is used for generating an image set by using the object image to be processed and the image block, and extracting the characteristics of the image set by using a neural network corresponding to the modal class of the object image to be processed to obtain a characteristic matrix corresponding to the object image to be processed;

and the coding submodule is used for coding the characteristic matrix by using the mode type and the cutting characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.

Optionally, the trimming sub-module may include:

the first cutting unit is used for cutting the object image to be processed according to the first preset cutting line number and the transverse cutting mode to obtain a first image block;

the second cutting unit is used for cutting the object image to be processed according to a second preset cutting line number, a preset cutting line number and a horizontal and vertical cutting mode to obtain a second image block;

a setting unit for setting the first image block and the second image block as image blocks.

Optionally, the encoding submodule may include:

the acquisition unit is used for acquiring the cutting mode corresponding to the image block, the relative position in the object image to be processed and the corresponding characteristic vector in the characteristic matrix;

the code generation unit is used for generating a position code by using the relative position and the feature codes in the feature vectors, and generating a cutting code and a mode code by using a cutting mode and a mode category respectively;

and the coding matrix generating unit is used for generating a coding matrix by utilizing the characteristic coding, the position coding, the cropping coding and the mode coding.

Optionally, the feature extraction sub-module is specifically configured to scale each image block in the image set to a preset size, and input the image set subjected to scaling processing to the neural network for feature extraction.

Optionally, the apparatus may further include:

and the initial prediction vector generation module is used for calculating cosine similarity between the interactive feature matrixes and generating an initial prediction vector by utilizing the cosine similarity.

Optionally, the predicting branch module 604 may include:

the prediction branch is used for inputting the initial prediction vector to the self-attention module to extract self-attention characteristics, and adding the obtained self-attention characteristics and the initial prediction vector to obtain intermediate prediction characteristics; receiving a first intermediate characteristic sent by the first characteristic extraction branch and a second intermediate characteristic sent by the second characteristic extraction branch; performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature; summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature; and inputting the fusion features into a normalization layer for processing to obtain a prediction vector.

Optionally, the feature extraction branch and the prediction branch have a multilayer structure, and after the intermediate prediction feature is obtained, the prediction branch can be further used for sending the intermediate prediction feature to the feature extraction branch;

correspondingly, after the intermediate features corresponding to the object image to be processed are obtained, the feature extraction branch in the feature extraction branch module 603 may also be used to perform cross-attention feature extraction by using the intermediate features and the received intermediate prediction features to obtain inter-layer features; sending the interlayer features to a next layer of feature extraction branch;

correspondingly, after the prediction vector is obtained, the prediction branch can also be used for sending the prediction vector to the next layer of prediction branch.

An embodiment of the present invention further provides an electronic device, including:

a memory for storing a computer program;

a processor for implementing the steps of the object identification method as described above when executing the computer program.

Since the embodiment of the electronic device portion corresponds to the embodiment of the object identification method portion, please refer to the description of the embodiment of the object identification method portion for the embodiment of the electronic device portion, which is not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the object identification method in any of the above embodiments are implemented.

Since the embodiment of the computer-readable storage medium portion corresponds to the embodiment of the object identification method portion, please refer to the description of the embodiment of the object identification method portion for the embodiment of the storage medium portion, which is not described herein again.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The object identification method, the object identification device, the electronic device, and the computer-readable storage medium according to the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. An object recognition method, comprising:

cutting, feature extraction and coding processing are carried out on the obtained object image to be processed, and a coding matrix corresponding to the object image to be processed is obtained; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various characteristics of the object image to be processed;

2. The object identification method according to claim 1, wherein the feature interaction branch performs self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed, and includes:

3. The object recognition method according to claim 2, wherein the performing of the cross-attention feature extraction on the local intermediate feature and the opposite-end intermediate feature comprises:

wherein

Representing an initial cross-attention feature, said

Representing said local intermediate feature, said

Representing said peer intermediate feature, said

Represents a normalization function, said

The above-mentioned

And said

Representing a pre-trained weight matrix, said

Represents a transpose operation, the

Representing dimensions of the opposite-end intermediate features;

4. The object recognition method according to claim 2, wherein the feature interaction branch has a multi-layer structure, and after obtaining the interaction feature matrix, further comprises:

judging whether a next layer of feature interaction branch exists or not;

5. The object identification method according to claim 1, wherein the obtaining of the encoding matrix corresponding to the object image to be processed by performing cropping, feature extraction, and encoding on the obtained object image to be processed includes:

generating an image set by using the object image to be processed and the image block, and performing feature extraction on the image set by using a neural network corresponding to a modality category of the object image to be processed to obtain a feature matrix corresponding to the object image to be processed;

6. The object identification method according to claim 5, wherein the cropping the object image to be processed to obtain an image block corresponding to the object image to be processed comprises:

setting the first image block and the second image block as the image blocks.

7. The object recognition method according to claim 6, wherein the encoding processing on the feature matrix by using the modality category and the cropping feature information of each image block to obtain an encoding matrix corresponding to the object image to be processed comprises:

generating a position code by using the relative position and a feature code in the feature vector, and generating a cropping code and a mode code by using the cropping mode and the mode category respectively;

8. The object recognition method according to claim 5, wherein the performing the feature extraction on the image set by using a neural network corresponding to a modality class of the object image to be processed comprises:

9. The object recognition method according to claim 1, further comprising, before inputting the initial prediction vector generated using the inter-feature matrix and the intermediate features into a prediction branch:

10. The object recognition method according to any one of claims 1 to 9, wherein the prediction branch performs self-attention feature extraction on the initial prediction vector and performs cross-attention feature extraction using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector, and the method comprises:

receiving a first intermediate characteristic sent by the first characteristic extraction branch and a second intermediate characteristic sent by the second characteristic extraction branch;

11. The object recognition method according to claim 10, wherein the feature extraction branch and the prediction branch have a multilayer structure, and further comprise, after obtaining the intermediate prediction feature:

sending the interlayer features to a next layer of feature extraction branch;

12. An object recognition device, comprising:

13. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the object identification method as claimed in any one of claims 1 to 11 when executing the computer program.

14. A computer-readable storage medium having computer-executable instructions stored thereon which, when loaded and executed by a processor, carry out a method of object identification according to any one of claims 1 to 11.