CN114663737A - Object identification method and device, electronic equipment and computer readable storage medium - Google Patents

Object identification method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN114663737A
CN114663737A CN202210546400.7A CN202210546400A CN114663737A CN 114663737 A CN114663737 A CN 114663737A CN 202210546400 A CN202210546400 A CN 202210546400A CN 114663737 A CN114663737 A CN 114663737A
Authority
CN
China
Prior art keywords
feature
attention
image
processed
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210546400.7A
Other languages
Chinese (zh)
Other versions
CN114663737B (en
Inventor
李晓川
赵雅倩
李仁刚
郭振华
范宝余
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202210546400.7A priority Critical patent/CN114663737B/en
Publication of CN114663737A publication Critical patent/CN114663737A/en
Application granted granted Critical
Publication of CN114663737B publication Critical patent/CN114663737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an object identification method, an object identification device, electronic equipment and a computer readable storage medium, and relates to the field of pattern identification, wherein when a target image and a candidate image which contain an object and have uncertain modes are obtained, the method can extract interactive features of the images by using a self-attention feature extraction mode and a cross-attention feature extraction mode so as to obtain an interactive feature matrix corresponding to the images, so that the pertinence of an attention mechanism in dealing with the problem of weight identification of an object in a non-determined cross-mode can be effectively improved; in addition, the two interactive feature matrixes can be merged into a probability space, a predicted value capable of representing the probability that the target image and the candidate image belong to the same object is generated in the space, namely, a probability prediction method can be adopted, the problem of re-recognition of the non-determined cross-modal object is solved in the probability space, and therefore the accuracy of the re-recognition of the non-determined cross-modal object can be effectively improved.

Description

Object identification method and device, electronic equipment and computer readable storage medium
Technical Field
The present invention relates to the field of pattern recognition, and in particular, to an object recognition method, an object recognition apparatus, an electronic device, and a computer-readable storage medium.
Background
The problem of Cross-Modal (Cross Modal) object re-recognition, which can be described simply as: whether the objects included in the two images in different modalities are the same object is determined, for example, whether the objects included in the visible light image and the infrared light image are the same object is determined. The existing cross-modal object weight identification method is based on the assumption of determining cross-modal, that is, an object to be identified is in multiple modalities, however, in practical application, the appearance of the object in each modal image is uncertain, for example, the object may appear in a visible light image, an infrared light image or in both images, and further whether the object contains multiple modalities cannot be judged, which leads to the fact that the existing cross-modal object weight identification method cannot solve the problem of non-determined cross-modal object weight identification fundamentally. From another perspective, the existing cross-modality object weight recognition method generally uses a feature space distance to perform cross-modality object weight recognition, for example, to extract features of a visible light image and features of an infrared light image, and calculate a distance between the two features, so as to determine whether the two images belong to the same object by using the distance. However, since the distance between the homomorphic images is naturally shorter than the distance between the heteromorphic images, the non-deterministic cross-modal object recognition problem is difficult to realize in the re-recognition architecture based on the feature space distance. For example, for a pedestrian a and a pedestrian B who also wear red clothes, a situation may occur in which the feature space distance between the visible light images of both of them is closer than the feature space between the trans-modal images of the pedestrian a itself, and thus a problem of an object re-recognition error is easily caused. Obviously, the existing cross-modal object weight identification method cannot solve the problem of non-determined cross-modal object weight identification fundamentally.
Therefore, for those skilled in the art, how to effectively improve the accuracy of the weight recognition of the non-deterministic cross-modal object is a technical problem to be solved in the art.
Disclosure of Invention
The invention aims to provide an object identification method, an object identification device, an electronic device and a computer-readable storage medium, which can process the re-identification problem of a non-determined cross-modal object in a probability space through a self-attention feature extraction mode, a cross-attention feature extraction mode and a probability prediction method, so that the accuracy of the weight identification of the non-determined cross-modal object can be effectively improved.
In order to solve the above technical problem, the present invention provides an object identification method, including:
cutting, feature extraction and coding processing are carried out on the obtained object image to be processed, and a coding matrix corresponding to the object image to be processed is obtained; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed;
inputting the coding matrix into two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;
inputting the interactive feature matrix into two feature extraction branches in parallel, so that the feature extraction branches extract the self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed;
inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector;
and judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.
Optionally, the performing, by the feature interaction branch, self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed includes:
the feature interaction branch circuit extracts the self-attention feature of the coding matrix, and adds the obtained self-attention feature and the coding matrix to obtain a local end intermediate feature;
sending the local terminal intermediate feature to another feature interaction branch, and receiving an opposite terminal intermediate feature sent by the other feature interaction branch;
and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain the interactive feature matrix.
Optionally, the performing cross-attention feature extraction on the local-end intermediate feature and the opposite-end intermediate feature includes:
the feature interaction branch circuit performs attention-crossing feature extraction on the local terminal middle feature and the opposite terminal middle feature in the following mode:
Figure 418130DEST_PATH_IMAGE001
wherein
Figure 301773DEST_PATH_IMAGE002
Representing an initial cross-attention feature, said
Figure 57370DEST_PATH_IMAGE003
Representing said local intermediate feature, said
Figure 88780DEST_PATH_IMAGE004
Representing said peer intermediate feature, said
Figure 243818DEST_PATH_IMAGE005
Represents a normalization function, said
Figure 60595DEST_PATH_IMAGE006
The above-mentioned
Figure 377307DEST_PATH_IMAGE007
And said
Figure 896013DEST_PATH_IMAGE008
Representing a pre-trained weight matrix, said
Figure 133703DEST_PATH_IMAGE009
Represents a transpose operation, the
Figure 726359DEST_PATH_IMAGE010
Representing the dimension of the opposite-end intermediate feature;
and carrying out random deletion processing and normalization processing on the initial cross-attention feature to obtain the cross-attention feature.
Optionally, the feature interaction branch has a multilayer structure, and after obtaining the interaction feature matrix, the method further includes:
judging whether a next layer of feature interaction branch exists or not;
if yes, inputting the interactive feature matrix into the next layer of feature interactive branch for processing;
and if not, inputting the interactive feature matrix into the two feature extraction branches in parallel.
Optionally, the cutting, feature extraction, and encoding the obtained image of the object to be processed to obtain an encoding matrix corresponding to the image of the object to be processed includes:
cutting the object image to be processed to obtain an image block corresponding to the object image to be processed;
generating an image set by using the object image to be processed and the image blocks, and performing the feature extraction on the image set by using a neural network corresponding to the modal class of the object image to be processed to obtain a feature matrix corresponding to the object image to be processed;
and performing the coding processing on the characteristic matrix by using the modal class and the cropping characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.
Optionally, the cutting the object image to be processed to obtain an image block corresponding to the object image to be processed includes:
cutting the object image to be processed according to a first preset cutting line number and a transverse cutting mode to obtain a first image block;
cutting the object image to be processed according to a second preset cutting line number, a preset cutting line number and a horizontal and vertical cutting mode to obtain a second image block;
setting the first image block and the second image block as the image blocks.
Optionally, the encoding processing is performed on the feature matrix by using the modality category and the cropping feature information of each image block to obtain an encoding matrix corresponding to the object image to be processed, and the encoding processing includes:
acquiring a cutting mode corresponding to the image block, a relative position in the object image to be processed and a corresponding feature vector in the feature matrix;
generating a position code by using the relative position and a feature code in the feature vector, and generating a cropping code and a modality code by using the cropping mode and the modality category respectively;
and generating the coding matrix by utilizing the feature code, the position code, the cropping code and the modal code.
Optionally, the performing, by using a neural network corresponding to a modality category of the object image to be processed, the feature extraction on the image set includes:
and scaling each image block in the image set to a preset size, and inputting the image set subjected to scaling processing into the neural network for feature extraction.
Optionally, before inputting the initial prediction vector generated by using the interaction feature matrix and the intermediate feature into the prediction branch, the method further includes:
and calculating cosine similarity between the interactive feature matrixes, and generating the initial prediction vector by using the cosine similarity.
Optionally, the performing, by the prediction branch, self-attention feature extraction on the initial prediction vector, and performing cross-attention feature extraction by using the obtained intermediate prediction feature and the intermediate feature to obtain a prediction vector, includes:
the prediction branch inputs the initial prediction vector to a self-attention module for self-attention feature extraction, and adds the obtained self-attention feature and the initial prediction vector to obtain an intermediate prediction feature;
receiving a first intermediate feature sent by the first feature extraction branch and a second intermediate feature sent by the second feature extraction branch;
performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature;
summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature;
and inputting the fusion features into a normalization layer for processing to obtain the prediction vector.
Optionally, the feature extraction branch and the prediction branch have a multilayer structure, and after obtaining the intermediate prediction feature, the method further includes:
the prediction branch sends the intermediate prediction features to the feature extraction branch;
correspondingly, after obtaining the intermediate features corresponding to the object image to be processed, the method further includes:
the feature extraction branch circuit performs attention-crossing feature extraction by using the intermediate features and the received intermediate prediction features to obtain interlayer features;
sending the interlayer features to a next layer of feature extraction branch;
correspondingly, after obtaining the prediction vector, the method further includes:
and the prediction branch sends the prediction vector to a next layer prediction branch.
The present invention also provides an object recognition apparatus comprising:
the characteristic extraction module is used for cutting, extracting characteristics and coding the obtained object image to be processed to obtain a coding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed;
the feature interaction module is used for inputting the coding matrix into two feature interaction branches in parallel so that the feature interaction branches can perform self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;
the feature extraction branch module is used for inputting the interactive feature matrix into two feature extraction branches in parallel so that the feature extraction branches can extract the self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed;
the prediction branch module is used for inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so as to enable the prediction branch to extract self-attention features of the initial prediction vector and extract cross-attention features by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector;
and the judging module is used for judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.
The present invention also provides an electronic device comprising:
a memory for storing a computer program;
a processor for implementing the object identification method as described above when executing the computer program.
The present invention also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the object identification method as described above is implemented.
The invention provides an object identification method, which comprises the following steps: cutting, feature extraction and coding processing are carried out on the obtained object image to be processed, and a coding matrix corresponding to the object image to be processed is obtained; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed; inputting the coding matrix into two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed; inputting the interactive feature matrix into two feature extraction branches in parallel so that the feature extraction branches can extract self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the to-be-processed object image; inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector; and judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.
Therefore, when a target image and a candidate image containing an object are obtained, the object image to be processed is cut, subjected to feature extraction and subjected to encoding processing to obtain an encoding matrix corresponding to the object image to be processed, wherein the encoding matrix contains various kinds of feature information of the corresponding image; furthermore, the coding matrix is input into two identical characteristic interaction branches to carry out self-attention characteristic extraction and cross-attention characteristic extraction, so that an interaction characteristic matrix corresponding to the object image to be processed is obtained, wherein the branch can carry out cross-attention characteristic extraction on the coding matrix received at the local end by utilizing the self-attention characteristics of the two coding matrices, and the pertinence of an attention mechanism in response to the problem of non-determined cross-mode object weight identification can be effectively improved; furthermore, the invention can generate the initial prediction vector between the interactive feature matrixes of the two images of the object to be processed, and input the interactive feature matrixes into two same feature extraction branches in parallel, inputting the initial prediction vector into a prediction branch of the two-span predictor so that the feature extraction branch adopts a self-attention mechanism to extract the self-attention features of the interactive feature matrix, and fusing the self-attention features by the prediction branch by adopting a cross-attention mechanism, generating a prediction vector according to the obtained fusion features, finally reducing the dimension of the prediction vector to obtain a prediction value capable of representing the probability that the target image and the candidate image belong to the same object, namely, a probability prediction method can be adopted to process the problem of re-identification of the non-determined cross-modal object in a probability space, and the accuracy of the re-identification of the non-determined cross-modal object can be effectively improved. The invention also provides an object recognition device, electronic equipment and a computer readable storage medium, which have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of an object identification method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a feature interactor according to an embodiment of the present invention;
FIG. 3 is a block diagram of a dual-span predictor according to an embodiment of the present invention;
FIG. 4 is a block diagram of another dual-span predictor provided by the embodiment of the present invention;
FIG. 5 is a schematic diagram of image cropping according to an embodiment of the present invention;
fig. 6 is a block diagram of an object recognition apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The existing cross-modal object weight identification method is based on the assumption of determining cross-modal, namely, an object to be identified is in multiple modes, so that the existing cross-modal object weight identification method cannot solve the problem of non-determined cross-modal object weight identification fundamentally. In view of the above, the present invention provides an object identification method, which can process the re-identification problem of the non-determined cross-modal object in the probability space through a self-attention feature extraction method, a cross-attention feature extraction method and a probability prediction method, so as to effectively improve the accuracy of the weight identification of the non-determined cross-modal object. Referring to fig. 1, fig. 1 is a flowchart of an object identification method according to an embodiment of the present invention, where the method includes:
s101, performing cutting, feature extraction and coding processing on the obtained object image to be processed to obtain a coding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various characteristics of the object image to be processed.
It is understood that the target image and the candidate image should include the object to be measured, and the final purpose of the embodiment of the present invention is to determine whether the object to be measured in the candidate image is the same as the object to be measured in the target image. The present invention is not limited to specific objects to be measured, and may be, for example, pedestrians, vehicles, and the like. It should be noted that the target image and the candidate image in the embodiment of the present invention are in a non-specific modality state, that is, in brief, the corresponding acquisition manner of the target image and the candidate image is uncertain. For example, the target image and the candidate image may be acquired by a visible light image acquisition method, an infrared light image acquisition method, or a visible light image acquisition method and an infrared light image acquisition method, respectively. Of course, the collection mode is not limited to the two modes, and more modes can be adopted for collection, and the collection mode can be selected according to the actual application requirements. It should be noted that, for convenience of description, the target image and the candidate image are collectively referred to as an object image to be processed in the embodiment of the present invention.
Further, when the object images to be processed are obtained, the embodiments of the present invention perform cropping, feature extraction, and encoding on the images to obtain an encoding matrix corresponding to each object image to be processed, where the encoding matrix includes various features of the object image to be processed, such as image features, modal features, and cropping features. The following briefly introduces various processes:
the purpose of the cropping processing is to segment the object image to be processed into a plurality of image blocks so as to extract the local features of the object image to be processed by using each image block. The embodiment of the present invention does not limit a specific cutting manner as long as the object of the cutting process can be satisfied.
The feature extraction is the traditional image feature extraction processing, namely, the object image to be processed and each image block corresponding to the object image are input into the neural network model for image feature extraction, and a feature matrix corresponding to each image is obtained. It should be noted that, in the embodiment of the present invention, a neural network model specifically required to be used in the feature extraction process is not limited, and any existing convolutional neural network model (such as reseest, resenxt, and EfficientNet) can meet application requirements and can be selected according to actual needs. In addition, it should be noted that, because the modality states of the images of the object to be processed are different, the images need to be input into the corresponding neural network for feature extraction according to the modality type corresponding to each image, for example, a visible light image is input into the neural network corresponding to the visible light type for feature extraction, and an infrared light image is input into the neural network corresponding to the infrared light type for feature extraction. It should be noted that, the embodiment of the present invention does not limit the specific neural network corresponding to each modality, and may be set according to the actual application requirement.
The purpose of the encoding processing is to further fuse other features of the object image to be processed, such as modal features and cropping features, on the feature matrix based on the feature extraction. The embodiment of the invention sets corresponding code values for each mode type and cutting mode, and fuses the code values into the characteristic matrix of the object image to be processed. It should be noted that, the embodiment of the present invention does not limit the specific code value, as long as the mode type and the cropping mode can be distinguished, and the setting can be performed according to the actual application requirement. Of course, the code value may be further set in combination with the eigenvalue in the eigenvalue matrix.
S102, inputting the coding matrix into the two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix, and an interaction feature matrix corresponding to the to-be-processed object image is obtained.
For easy understanding, please refer to fig. 2, fig. 2 is a block diagram illustrating a feature interaction device according to an embodiment of the present invention. In the figure, the area enclosed by the dashed box represents a Cross-attention layer, which includes two left and right feature interaction branches, each of which includes a Self-attention subunit composed of a Self-attention module (Self-attention), a random deletion module (Dropout), a layer Normalization module (Normalization), and an addition module, and a Cross-attention subunit composed of a Cross-attention module (Cross-attention), a random deletion module, a layer Normalization module, and an addition module. For convenience of description, the feature interaction branch on the Q coding feature side is hereinafter referred to as a feature interaction branch 1, and the feature interaction branch on the G coding feature side is hereinafter referred to as a feature interaction branch 2. Taking a single-layer cross-attention layer as an example, for the feature interaction branch 1, after receiving the Q coding feature, the branch first performs self-attention feature extraction on the matrix by using a self-attention module, a random deletion module and a layer normalization module, and inputs the obtained self-attention feature and the coding matrix to an addition module for addition processing, so as to obtain an intermediate feature 1. The feature interaction branch 2 also performs the intermediate feature generation process described above to obtain the intermediate feature 2. Subsequently, the feature interaction branch 1 acquires an intermediate feature 2 from the feature interaction branch 2, inputs the intermediate feature 1 and the intermediate feature 2 into the attention-crossing module, the random deletion module and the layer normalization module together to extract the attention-crossing feature of the intermediate matrix 1, and inputs the acquired attention-crossing feature and the intermediate matrix 1 into the addition module for addition processing to obtain a Q interaction feature corresponding to the target image. The feature interaction branch 2 also executes the above-mentioned interactive feature generation processing process to obtain G interactive features corresponding to the candidate images.
In a possible case, the feature interaction branch performs self-attention feature extraction and cross-attention feature extraction on the encoding matrix to obtain an interaction feature matrix corresponding to the object image to be processed, and may include:
step 11: the feature interaction branch circuit extracts the self-attention feature of the coding matrix, and adds the obtained self-attention feature and the coding matrix to obtain a local end intermediate feature;
step 12: sending the intermediate characteristic of the local terminal to another characteristic interaction branch, and receiving the intermediate characteristic of the opposite terminal sent by the other characteristic interaction branch;
step 13: and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain an interactive feature matrix.
It can be understood that, for the above feature interaction branch 1, the middle feature at the home end is the middle feature 1, and the middle feature at the opposite end is the middle feature 2; for the feature interaction branch 2, the above correspondence is reversed.
Further, for the extraction of the self-attention feature, the existing self-attention mechanism can be adopted for extraction:
Figure 604185DEST_PATH_IMAGE011
wherein,
Figure 485553DEST_PATH_IMAGE012
the self-attention feature is shown to be,
Figure 592180DEST_PATH_IMAGE003
the matrix input from the attention module is represented in this formula,
Figure 570501DEST_PATH_IMAGE005
the expression of the normalization function is used,
Figure 760174DEST_PATH_IMAGE006
Figure 207467DEST_PATH_IMAGE007
and
Figure 39156DEST_PATH_IMAGE008
represents a pre-trained weight matrix that is,
Figure 668721DEST_PATH_IMAGE009
it is shown that the transpose operation,
Figure 763716DEST_PATH_IMAGE013
representing the dimensions of the matrix input from the attention module.
For the cross-attention feature extraction, the embodiment of the present invention may perform the following processing:
in one possible case, the performing the cross-attention feature extraction on the local-end intermediate feature and the opposite-end intermediate feature may include:
step 21: the feature interaction branch circuit performs attention-crossing feature extraction on the local terminal middle feature and the opposite terminal middle feature in the following mode:
Figure 229463DEST_PATH_IMAGE014
wherein
Figure 802527DEST_PATH_IMAGE002
The initial cross-attention feature is represented,
Figure 552177DEST_PATH_IMAGE003
the local-end intermediate features are represented in the present formula,
Figure 818074DEST_PATH_IMAGE015
the characteristics of the middle of the opposite end are shown,
Figure 768188DEST_PATH_IMAGE005
the expression of the normalization function is used,
Figure 410522DEST_PATH_IMAGE006
Figure 686782DEST_PATH_IMAGE007
and
Figure 248213DEST_PATH_IMAGE008
represents a pre-trained weight matrix that is,
Figure 813187DEST_PATH_IMAGE009
which represents the operation of the transpose(s),
Figure 603420DEST_PATH_IMAGE010
representing the dimension of the middle feature of the opposite end;
step 22: and carrying out random deletion processing and normalization processing on the initial cross-attention features to obtain the cross-attention features.
It should be noted that, the embodiments of the present invention do not limit the specific structures and uses of the random deletion module, the layer normalization module, and the addition module, and these modules are all common modules in the field of neural networks, and refer to related technologies of neural networks.
Further, it should be noted that, in the embodiment of the present invention, the feature interaction branch (i.e., the attention-crossing layer) may be a single-layer structure, or may be a multilayer structure as shown in fig. 2, and the number of feature extractions can be effectively increased by setting the multilayer feature interaction branch, so as to improve the recognition performance of the re-recognition system. It should be noted that, the embodiment of the present invention does not limit the specific number of layers of the feature interaction branches, and may be set according to the actual application requirements.
In one possible case, the feature interaction branch has a multi-layer structure, and after obtaining the interaction feature matrix, the method further includes:
step 31: judging whether a next layer of feature interaction branch exists or not; if yes, go to step 32; if not, go to step 33;
step 32: inputting the interactive feature matrix into a next layer of feature interactive branch for processing;
step 33: and entering a step of inputting the interactive feature matrix into the two feature extraction branches in parallel.
S103, inputting the interactive feature matrix into the two feature extraction branches in parallel, so that the feature extraction branches can extract the attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed.
As mentioned before, for the non-deterministic cross-modality re-identification problem, it is not reasonable to perform distance calculation on the features in the feature space, since for the images with the same modality, the feature distance in the feature space is obviously shorter than the feature distance between the images without the same modality, which will seriously affect the processing effect of the non-deterministic cross-modality re-identification problem. Therefore, the embodiment of the invention abandons the use of a classical characteristic metric learning method in the field of re-identification, and uses a probability prediction method instead to process the problem of non-deterministic cross-modal re-identification in a probability space. The probability space is independent of the modalities, so that the difference between different modalities does not need to be considered, and better re-identification effect can be achieved compared with the existing scheme.
For easy understanding, please refer to fig. 3, fig. 3 is a block diagram illustrating a structure of a dual-stride predictor according to an embodiment of the present invention, wherein an area surrounded by a dashed box represents a single dual-stride prediction layer, two leftmost branches and two rightmost branches are feature extraction branches, and a middle branch is a prediction branch. In the embodiment of the invention, the feature extraction branch comprises a self-attention module, a layer normalization module and an addition module, and is used for performing self-attention feature extraction on the received interactive feature matrix, adding the obtained self-attention feature and the interactive feature to obtain an intermediate feature, and finally sending the intermediate feature to the cross-attention layer of the prediction branch. For the process of performing the self-attention feature extraction by the feature extraction branch, reference may be made to the foregoing embodiments, which are not described herein again.
And S104, inputting the initial prediction vector and the intermediate features generated by the interactive feature matrix into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain the prediction vector.
Referring also to fig. 3, it can be seen that the prediction branch consists of a self-attention unit consisting of a self-attention module, a layer normalization module, and an addition module, and a dual-cross-attention unit consisting of a cross-attention module, an attention-averaging module, a layer normalization and addition module. Wherein the input from the attention unit is an initial prediction vector having a magnitude of
Figure 203028DEST_PATH_IMAGE016
Wherein
Figure 935361DEST_PATH_IMAGE017
And outputting the feature dimension of the convolution neural network model to a single image in the feature extraction processing. It should be noted that the embodiment of the present invention does not limit the generation manner of the initial prediction vector, for example, one size is set as
Figure 987630DEST_PATH_IMAGE018
The zero vector of (2) is set as an initial prediction vector, and the cosine similarity between two interactive feature vectors can also be calculated and copied
Figure 581554DEST_PATH_IMAGE019
And obtaining an initial prediction vector. In the embodiment of the invention, in order to improve the calculation efficiency, the initial prediction vector can be generated by utilizing the cosine similarity between the interactive feature matrixes.
In one possible case, before inputting the initial prediction vector generated by using the interactive feature matrix and the intermediate features into the prediction branch, the method may further include:
step 41: and calculating cosine similarity between the interactive feature matrixes, and generating an initial prediction vector by using the cosine similarity.
Further, when the prediction branch acquires the initial prediction vector, the self-attention feature extraction is carried out on the prediction branch, and the obtained self-attention feature and the initial prediction are added to obtain an intermediate prediction feature. Subsequently, the branch can obtain intermediate features generated by the two feature extraction branches, and the two intermediate features are respectively utilized to perform attention-crossing feature extraction on the intermediate prediction features to obtain two attention-crossing features. The embodiment of the invention performs prediction by fusing the two attention-crossing features, specifically, performs summation and average processing on the two attention-crossing features to obtain a fused feature, and performs normalization processing on the fused feature to obtain a prediction vector.
In a possible case, the performing, by the prediction branch, self-attention feature extraction on the initial prediction vector and performing cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain the prediction vector may include:
step 51: the prediction branch inputs the initial prediction vector to a self-attention module for self-attention feature extraction, and adds the obtained self-attention feature and the initial prediction vector to obtain an intermediate prediction feature;
step 52: receiving a first intermediate feature sent by the first feature extraction branch and a second intermediate feature sent by the second feature extraction branch;
step 53: performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature;
step 54: and summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature.
The fusion process for the cross-attention feature can be expressed as:
Figure 770090DEST_PATH_IMAGE020
wherein,
Figure 673324DEST_PATH_IMAGE021
a first cross-attention feature is represented,
Figure 212889DEST_PATH_IMAGE022
representing the second cross-attention feature, left of the equation
Figure 610504DEST_PATH_IMAGE023
Representing the fusion characteristics, appearing to the right of the equation
Figure 919125DEST_PATH_IMAGE023
Both represent intermediate prediction features.
Step 55: and inputting the fusion features into a normalization layer for processing to obtain a prediction vector.
Of course, the two-span predictor may also have a multi-layer structure. Referring to fig. 4, fig. 4 is a block diagram illustrating another dual-span predictor according to an embodiment of the present invention. Therefore, in the multilayer double-span scorer, the attention-span module, the layer normalization module and the addition module are additionally added in the feature extraction branch circuit and are used for extracting the attention-span features. And the prediction branch also needs to additionally send intermediate prediction features to the feature extraction branch so that the feature extraction branch can perform attention-crossing feature extraction. It should be noted that, because the workflow of the feature extraction branch is similar to that of the feature interaction branch, for the workflow of the feature extraction branch having a multilayer structure, reference may be made to the related description of the feature interaction branch.
In a possible case, the feature extraction branch and the prediction branch have a multi-layer structure, and after obtaining the intermediate prediction feature, the method may further include:
step 61: the prediction branch sends the intermediate prediction features to the feature extraction branch;
correspondingly, after obtaining the intermediate features corresponding to the object image to be processed, the method further includes:
step 62: the feature extraction branch utilizes the intermediate features and the received intermediate prediction features to carry out attention-crossing feature extraction to obtain interlayer features;
and step 63: sending the interlayer characteristics to a next layer of characteristic extraction branch;
correspondingly, after obtaining the prediction vector, the method further comprises:
step 64: and the prediction branch sends the prediction vector to the next layer of prediction branch.
And S105, judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.
It should be noted that, the embodiment of the present invention does not limit the specific dimension reduction manner of the prediction vector, and may refer to the related technology of the neural network. It can be understood that, after the predicted value is obtained, a preset threshold value can be used for comparing with the predicted value, and whether the target image and the candidate image belong to the same object or not is determined according to the comparison result.
Based on the embodiment, when the target image and the candidate image containing the object are obtained, the object images to be processed are cut, feature extracted and encoded to obtain the encoding matrix corresponding to the object images to be processed, wherein the encoding matrix contains various feature information of the corresponding images; furthermore, the coding matrix is input into two identical characteristic interaction branches to carry out self-attention characteristic extraction and cross-attention characteristic extraction, so that an interaction characteristic matrix corresponding to the object image to be processed is obtained, wherein the branch can carry out cross-attention characteristic extraction on the coding matrix received at the local end by utilizing the self-attention characteristics of the two coding matrices, and the pertinence of an attention mechanism in response to the problem of non-determined cross-mode object weight identification can be effectively improved; furthermore, the invention can generate the initial prediction vector between the interactive feature matrixes of the two images of the object to be processed, and input the interactive feature matrixes into two same feature extraction branches in parallel, inputting the initial prediction vector into a prediction branch of the two-span predictor so that the feature extraction branch adopts a self-attention mechanism to extract the self-attention features of the interactive feature matrix, and fusing the self-attention features by the prediction branch by adopting a cross-attention mechanism, generating a prediction vector according to the obtained fusion features, finally reducing the dimension of the prediction vector to obtain a prediction value capable of representing the probability that the target image and the candidate image belong to the same object, namely, a probability prediction method can be adopted to process the problem of re-identification of the non-determined cross-modal object in a probability space, and the accuracy of the re-identification of the non-determined cross-modal object can be effectively improved.
Based on the above-described embodiment, the following describes in detail the processes of cropping, feature extraction, and encoding of an image of an object to be processed. In a possible case, the cutting, feature extraction, and encoding processing are performed on the obtained object image to be processed to obtain an encoding matrix corresponding to the object image to be processed, and the method may include:
s201, cutting the object image to be processed to obtain an image block corresponding to the object image to be processed.
It should be noted that, the embodiment of the present invention does not limit a specific cropping manner, for example, the object image to be processed may be transversely cropped to obtain an image block with a preset number of cropping lines; or performing horizontal and vertical cutting on the object image to be processed according to the preset number of rows and the preset number of columns to obtain an image block. Of course, the object image to be processed may also be cropped in a plurality of cropping manners, so as to extract the local features of the object image to be processed by using the image blocks with different sizes.
In a possible case, the cutting the object image to be processed to obtain an image block corresponding to the object image to be processed may include:
step 71: cutting the object image to be processed according to the first preset cutting line number and the transverse cutting mode to obtain a first image block;
step 72: cutting the object image to be processed according to a second preset cutting line number, a preset cutting column number and a horizontal-vertical cutting mode to obtain a second image block;
step 73: the first image block and the second image block are set as image blocks.
It should be noted that, the embodiment of the present invention does not limit the specific values of the first preset trimming line number, the second preset trimming line number, and the preset trimming column number, and the specific values may be set according to the actual application requirements. In addition, a plurality of groups of second preset cutting line numbers and preset cutting line numbers can be set for carrying out horizontal and vertical cutting on the object image to be processed so as to obtain second image blocks with different sizes. Referring to fig. 5, fig. 5 is a schematic diagram of image cropping according to an embodiment of the present invention, in which the leftmost image is an original image, the middle image is a cropped image, 1-8 are used to mark image blocks, and the image block shown in the rightmost cross section is the first image block; and the block truncation A and the block truncation B are second image blocks obtained by cutting according to different preset cutting line numbers and preset cutting column numbers. Of course, it should be noted that the image blocks marked by 1-8 are only used to illustrate that the size of the image block can be adjusted at will, and are not used to limit the cropping mode, and the specific cropping mode can be set according to the actual application requirements. After the above-mentioned cropping, adding the original image itself to obtain
Figure 665365DEST_PATH_IMAGE024
Each image block, wherein m represents the number of horizontal cuts;
Figure 551281DEST_PATH_IMAGE025
and
Figure 877220DEST_PATH_IMAGE026
respectively representing the preset row number corresponding to the kth horizontal and vertical cutting mode.
S202, generating an image set by using the object image to be processed and the image block, and performing feature extraction on the image set by using a neural network corresponding to the modality category of the object image to be processed to obtain a feature matrix corresponding to the object image to be processed.
It will be understood that the image set comprises
Figure 912785DEST_PATH_IMAGE027
And each image block. After extracting the features of the image blocks, the neural network generates a block with the size of
Figure 33188DEST_PATH_IMAGE028
The feature matrix of (1), wherein
Figure 671979DEST_PATH_IMAGE017
The output feature dimensions for a single image block for a neural network. It should be noted that, due to the limitation of the neural network, before feature extraction, each image block in the image set needs to be scaled to a uniform preset size. It should be noted that, the embodiment of the present invention is not limited to a specific preset size, and can be set according to the actual application requirement.
In one possible case, the feature extraction of the image set by using the neural network corresponding to the modality category of the object image to be processed may include:
step 81: and zooming each image block in the image set to a preset size, and inputting the image set subjected to zooming into a neural network for feature extraction.
And S203, coding the characteristic matrix by using the mode type and the cropping characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.
The coding matrix and the feature matrix have the same size and can be generated in the following way:
Figure 536030DEST_PATH_IMAGE029
wherein
Figure 429031DEST_PATH_IMAGE030
Is shown as
Figure 720335DEST_PATH_IMAGE031
The coding sequence corresponding to each image block,
Figure 846423DEST_PATH_IMAGE032
representing the corresponding feature vector of the image block in the feature matrix,
Figure 983006DEST_PATH_IMAGE033
a position code representing the image block is encoded,
Figure 996092DEST_PATH_IMAGE034
a cropping code representing the image block,
Figure 723877DEST_PATH_IMAGE035
a modal encoding representing the image block. The generation of each type of code is as follows:
in a possible case, the encoding processing of the feature matrix by using the modality category and the cropping feature information of each image block to obtain an encoding matrix corresponding to the object image to be processed may include:
step 91: and acquiring a cutting mode corresponding to the image block, a relative position in the object image to be processed and a corresponding feature vector in the feature matrix.
And step 92: and generating a position code by using the relative position and the feature codes in the feature vectors, and generating a cropping code and a mode code by using a cropping mode and a mode category respectively.
Specifically, the position code can be generated as follows:
Figure 806103DEST_PATH_IMAGE036
wherein
Figure 277535DEST_PATH_IMAGE037
And
Figure 613970DEST_PATH_IMAGE038
the coding method respectively corresponds to the coding of an image block in two directions of vertical and horizontal directions in space, and the formula is as follows:
Figure 637289DEST_PATH_IMAGE039
wherein,
Figure 181314DEST_PATH_IMAGE040
representing the coding formula of the image block when the index value is even,
Figure 784333DEST_PATH_IMAGE041
representing an encoding formula of the image block when the index value is odd, d is an output characteristic dimension of an encoder, N represents a natural number set, pos represents a position index of the image block in the horizontal/vertical direction, and c represents the c-th characteristic code of the image block in a characteristic vector.
Further, the crop code may be generated as follows:
Figure 240853DEST_PATH_IMAGE042
wherein
Figure 903916DEST_PATH_IMAGE043
Figure 632838DEST_PATH_IMAGE044
Figure 993543DEST_PATH_IMAGE045
Respectively representing three cutting modes of no cutting, transverse cutting and transverse and vertical cutting, wherein 0, 1 and 2 are respectively the code values corresponding to the three cutting modes. It should be noted that the code value corresponding to each clipping modeCan be set according to the actual application requirements as long as the cutting modes can be distinguished.
Further, the modal encoding may be generated as follows:
Figure 757100DEST_PATH_IMAGE046
wherein I represents an image block and wherein I represents an image block,
Figure 856643DEST_PATH_IMAGE047
a visible light image is represented by a visible light image,
Figure 541702DEST_PATH_IMAGE048
representing an infrared light image, 0 and 1 represent the encoded values in the visible and infrared modes, respectively. It should be noted that the mode encoding can be set according to the actual application requirement, as long as various modes can be distinguished.
Step 93: and generating a coding matrix by using the characteristic coding, the position coding, the cropping coding and the modal coding.
Based on the embodiment, the invention can generate the coding matrix containing the global feature, the local feature, the cropping feature and the modal feature for the object image to be processed through cropping, feature extraction and coding processing, so that various features can be fused, and the accuracy of weight identification of the non-determined cross-modal object can be further improved.
In the following, the object recognition device, the electronic device, and the computer-readable storage medium according to the embodiments of the present invention are introduced, and the object recognition device, the electronic device, and the computer-readable storage medium described below and the object recognition method described above may be referred to in correspondence.
Referring to fig. 6, fig. 6 is a block diagram of an object recognition apparatus according to an embodiment of the present invention, where the apparatus may include:
the feature extraction module 601 is configured to perform cutting, feature extraction, and encoding on the obtained object image to be processed to obtain an encoding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various characteristics of the object image to be processed;
the feature interaction module 602 is configured to input the coding matrix to the two feature interaction branches in parallel, so that the feature interaction branches perform self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;
the feature extraction branch module 603 is configured to input the interaction feature matrix to the two feature extraction branches in parallel, so that the feature extraction branch performs self-attention feature extraction on the interaction feature matrix to obtain an intermediate feature corresponding to the object image to be processed;
a prediction branch module 604, configured to input an initial prediction vector and an intermediate feature generated by using the interaction feature matrix to a prediction branch, so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction feature and the intermediate feature to obtain a prediction vector;
and a determining module 605, configured to determine whether the target image and the candidate image belong to the same object by using a predicted value obtained by dimensionality reduction of the prediction vector.
Optionally, the feature interaction module 602 may include:
the characteristic interaction branch is used for extracting the self-attention characteristic of the coding matrix and adding the obtained self-attention characteristic and the coding matrix to obtain a local end intermediate characteristic; sending the local terminal intermediate feature to another feature interaction branch, and receiving the opposite terminal intermediate feature sent by another feature interaction branch; and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain an interactive feature matrix.
Optionally, the feature interaction branch is specifically configured to perform cross-attention feature extraction on the local end intermediate feature and the opposite end intermediate feature by using the following method:
Figure 768415DEST_PATH_IMAGE049
wherein
Figure 589740DEST_PATH_IMAGE002
The initial cross-attention feature is represented,
Figure 860185DEST_PATH_IMAGE003
the intermediate characteristics of the home terminal are shown,
Figure 501382DEST_PATH_IMAGE004
the characteristics of the middle of the opposite end are shown,
Figure 997698DEST_PATH_IMAGE005
the function of the normalization is expressed as,
Figure 798163DEST_PATH_IMAGE006
Figure 177192DEST_PATH_IMAGE007
and
Figure 571264DEST_PATH_IMAGE008
represents a pre-trained weight matrix that is,
Figure 139780DEST_PATH_IMAGE009
it is shown that the transpose operation,
Figure 670119DEST_PATH_IMAGE010
representing the dimension of the middle feature of the opposite end; and carrying out random deletion processing and normalization processing on the initial cross-attention features to obtain the cross-attention features.
Optionally, the feature interaction branch has a multi-layer structure, and the feature interaction module 602 may further include:
the judging submodule is used for judging whether a next layer of characteristic interaction branch exists or not; if yes, inputting the interactive feature matrix into a next layer of feature interactive branch for processing; and if not, inputting the interactive feature matrix into the two feature extraction branches in parallel.
Optionally, the feature extraction module 601 may include:
the cutting submodule is used for cutting the object image to be processed to obtain an image block corresponding to the object image to be processed;
the characteristic extraction submodule is used for generating an image set by using the object image to be processed and the image block, and extracting the characteristics of the image set by using a neural network corresponding to the modal class of the object image to be processed to obtain a characteristic matrix corresponding to the object image to be processed;
and the coding submodule is used for coding the characteristic matrix by using the mode type and the cutting characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.
Optionally, the trimming sub-module may include:
the first cutting unit is used for cutting the object image to be processed according to the first preset cutting line number and the transverse cutting mode to obtain a first image block;
the second cutting unit is used for cutting the object image to be processed according to a second preset cutting line number, a preset cutting line number and a horizontal and vertical cutting mode to obtain a second image block;
a setting unit for setting the first image block and the second image block as image blocks.
Optionally, the encoding submodule may include:
the acquisition unit is used for acquiring the cutting mode corresponding to the image block, the relative position in the object image to be processed and the corresponding characteristic vector in the characteristic matrix;
the code generation unit is used for generating a position code by using the relative position and the feature codes in the feature vectors, and generating a cutting code and a mode code by using a cutting mode and a mode category respectively;
and the coding matrix generating unit is used for generating a coding matrix by utilizing the characteristic coding, the position coding, the cropping coding and the mode coding.
Optionally, the feature extraction sub-module is specifically configured to scale each image block in the image set to a preset size, and input the image set subjected to scaling processing to the neural network for feature extraction.
Optionally, the apparatus may further include:
and the initial prediction vector generation module is used for calculating cosine similarity between the interactive feature matrixes and generating an initial prediction vector by utilizing the cosine similarity.
Optionally, the predicting branch module 604 may include:
the prediction branch is used for inputting the initial prediction vector to the self-attention module to extract self-attention characteristics, and adding the obtained self-attention characteristics and the initial prediction vector to obtain intermediate prediction characteristics; receiving a first intermediate characteristic sent by the first characteristic extraction branch and a second intermediate characteristic sent by the second characteristic extraction branch; performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature; summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature; and inputting the fusion features into a normalization layer for processing to obtain a prediction vector.
Optionally, the feature extraction branch and the prediction branch have a multilayer structure, and after the intermediate prediction feature is obtained, the prediction branch can be further used for sending the intermediate prediction feature to the feature extraction branch;
correspondingly, after the intermediate features corresponding to the object image to be processed are obtained, the feature extraction branch in the feature extraction branch module 603 may also be used to perform cross-attention feature extraction by using the intermediate features and the received intermediate prediction features to obtain inter-layer features; sending the interlayer features to a next layer of feature extraction branch;
correspondingly, after the prediction vector is obtained, the prediction branch can also be used for sending the prediction vector to the next layer of prediction branch.
An embodiment of the present invention further provides an electronic device, including:
a memory for storing a computer program;
a processor for implementing the steps of the object identification method as described above when executing the computer program.
Since the embodiment of the electronic device portion corresponds to the embodiment of the object identification method portion, please refer to the description of the embodiment of the object identification method portion for the embodiment of the electronic device portion, which is not described herein again.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the object identification method in any of the above embodiments are implemented.
Since the embodiment of the computer-readable storage medium portion corresponds to the embodiment of the object identification method portion, please refer to the description of the embodiment of the object identification method portion for the embodiment of the storage medium portion, which is not described herein again.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The object identification method, the object identification device, the electronic device, and the computer-readable storage medium according to the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (14)

1. An object recognition method, comprising:
cutting, feature extraction and coding processing are carried out on the obtained object image to be processed, and a coding matrix corresponding to the object image to be processed is obtained; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various characteristics of the object image to be processed;
inputting the coding matrix into two feature interaction branches in parallel, so that the feature interaction branches extract self-attention features and cross-attention features of the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;
inputting the interactive feature matrix into two feature extraction branches in parallel, so that the feature extraction branches extract the self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed;
inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so that the prediction branch performs self-attention feature extraction on the initial prediction vector, and performs cross-attention feature extraction by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector;
and judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.
2. The object identification method according to claim 1, wherein the feature interaction branch performs self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed, and includes:
the feature interaction branch circuit extracts the self-attention feature of the coding matrix, and adds the obtained self-attention feature and the coding matrix to obtain a local end intermediate feature;
sending the local terminal intermediate feature to another feature interaction branch, and receiving an opposite terminal intermediate feature sent by the other feature interaction branch;
and performing cross-attention feature extraction on the local terminal intermediate feature and the opposite terminal intermediate feature, and adding the obtained cross-attention feature and the local terminal intermediate feature to obtain the interactive feature matrix.
3. The object recognition method according to claim 2, wherein the performing of the cross-attention feature extraction on the local intermediate feature and the opposite-end intermediate feature comprises:
the feature interaction branch circuit performs attention-crossing feature extraction on the local terminal middle feature and the opposite terminal middle feature in the following mode:
Figure 534914DEST_PATH_IMAGE001
wherein
Figure 104436DEST_PATH_IMAGE002
Representing an initial cross-attention feature, said
Figure 910718DEST_PATH_IMAGE003
Representing said local intermediate feature, said
Figure 23030DEST_PATH_IMAGE004
Representing said peer intermediate feature, said
Figure 436825DEST_PATH_IMAGE005
Represents a normalization function, said
Figure 634589DEST_PATH_IMAGE006
The above-mentioned
Figure 306878DEST_PATH_IMAGE007
And said
Figure 273697DEST_PATH_IMAGE008
Representing a pre-trained weight matrix, said
Figure 123973DEST_PATH_IMAGE009
Represents a transpose operation, the
Figure 605770DEST_PATH_IMAGE010
Representing dimensions of the opposite-end intermediate features;
and carrying out random deletion processing and normalization processing on the initial cross-attention feature to obtain the cross-attention feature.
4. The object recognition method according to claim 2, wherein the feature interaction branch has a multi-layer structure, and after obtaining the interaction feature matrix, further comprises:
judging whether a next layer of feature interaction branch exists or not;
if yes, inputting the interactive feature matrix into the next layer of feature interactive branch for processing;
and if not, inputting the interactive feature matrix into the two feature extraction branches in parallel.
5. The object identification method according to claim 1, wherein the obtaining of the encoding matrix corresponding to the object image to be processed by performing cropping, feature extraction, and encoding on the obtained object image to be processed includes:
cutting the object image to be processed to obtain an image block corresponding to the object image to be processed;
generating an image set by using the object image to be processed and the image block, and performing feature extraction on the image set by using a neural network corresponding to a modality category of the object image to be processed to obtain a feature matrix corresponding to the object image to be processed;
and performing the coding processing on the characteristic matrix by using the modal class and the cropping characteristic information of each image block to obtain a coding matrix corresponding to the object image to be processed.
6. The object identification method according to claim 5, wherein the cropping the object image to be processed to obtain an image block corresponding to the object image to be processed comprises:
cutting the object image to be processed according to a first preset cutting line number and a transverse cutting mode to obtain a first image block;
cutting the object image to be processed according to a second preset cutting line number, a preset cutting line number and a horizontal and vertical cutting mode to obtain a second image block;
setting the first image block and the second image block as the image blocks.
7. The object recognition method according to claim 6, wherein the encoding processing on the feature matrix by using the modality category and the cropping feature information of each image block to obtain an encoding matrix corresponding to the object image to be processed comprises:
acquiring a cutting mode corresponding to the image block, a relative position in the object image to be processed and a corresponding feature vector in the feature matrix;
generating a position code by using the relative position and a feature code in the feature vector, and generating a cropping code and a mode code by using the cropping mode and the mode category respectively;
and generating the coding matrix by utilizing the feature code, the position code, the cropping code and the modal code.
8. The object recognition method according to claim 5, wherein the performing the feature extraction on the image set by using a neural network corresponding to a modality class of the object image to be processed comprises:
and scaling each image block in the image set to a preset size, and inputting the image set subjected to scaling processing into the neural network for feature extraction.
9. The object recognition method according to claim 1, further comprising, before inputting the initial prediction vector generated using the inter-feature matrix and the intermediate features into a prediction branch:
and calculating cosine similarity between the interactive feature matrixes, and generating the initial prediction vector by using the cosine similarity.
10. The object recognition method according to any one of claims 1 to 9, wherein the prediction branch performs self-attention feature extraction on the initial prediction vector and performs cross-attention feature extraction using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector, and the method comprises:
the prediction branch inputs the initial prediction vector to a self-attention module for self-attention feature extraction, and adds the obtained self-attention feature and the initial prediction vector to obtain an intermediate prediction feature;
receiving a first intermediate characteristic sent by the first characteristic extraction branch and a second intermediate characteristic sent by the second characteristic extraction branch;
performing cross-attention feature extraction on the intermediate prediction feature and the first intermediate feature to obtain a first cross-attention feature, and performing cross-attention feature extraction on the intermediate prediction feature and the second intermediate feature to obtain a second cross-attention feature;
summing the first cross-attention feature and the second cross-attention feature, and averaging the summation result to obtain a fusion feature;
and inputting the fusion features into a normalization layer for processing to obtain the prediction vector.
11. The object recognition method according to claim 10, wherein the feature extraction branch and the prediction branch have a multilayer structure, and further comprise, after obtaining the intermediate prediction feature:
the prediction branch sends the intermediate prediction features to the feature extraction branch;
correspondingly, after obtaining the intermediate features corresponding to the object image to be processed, the method further includes:
the feature extraction branch circuit performs attention-crossing feature extraction by using the intermediate features and the received intermediate prediction features to obtain interlayer features;
sending the interlayer features to a next layer of feature extraction branch;
correspondingly, after obtaining the prediction vector, the method further includes:
and the prediction branch sends the prediction vector to a next layer prediction branch.
12. An object recognition device, comprising:
the characteristic extraction module is used for cutting, extracting characteristics and coding the obtained object image to be processed to obtain a coding matrix corresponding to the object image to be processed; the object image to be processed comprises a target image and a candidate image, and the coding matrix comprises various features of the object image to be processed;
the feature interaction module is used for inputting the coding matrix into two feature interaction branches in parallel so that the feature interaction branches can perform self-attention feature extraction and cross-attention feature extraction on the coding matrix to obtain an interaction feature matrix corresponding to the object image to be processed;
the feature extraction branch module is used for inputting the interactive feature matrix into two feature extraction branches in parallel so that the feature extraction branches can extract the self-attention features of the interactive feature matrix to obtain intermediate features corresponding to the object image to be processed;
the prediction branch module is used for inputting an initial prediction vector generated by using the interactive feature matrix and the intermediate features into a prediction branch so as to enable the prediction branch to extract self-attention features of the initial prediction vector and extract cross-attention features by using the obtained intermediate prediction features and the intermediate features to obtain a prediction vector;
and the judging module is used for judging whether the target image and the candidate image belong to the same object or not by using a predicted value obtained by dimensionality reduction of the prediction vector.
13. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the object identification method as claimed in any one of claims 1 to 11 when executing the computer program.
14. A computer-readable storage medium having computer-executable instructions stored thereon which, when loaded and executed by a processor, carry out a method of object identification according to any one of claims 1 to 11.
CN202210546400.7A 2022-05-20 2022-05-20 Object identification method and device, electronic equipment and computer readable storage medium Active CN114663737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210546400.7A CN114663737B (en) 2022-05-20 2022-05-20 Object identification method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210546400.7A CN114663737B (en) 2022-05-20 2022-05-20 Object identification method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114663737A true CN114663737A (en) 2022-06-24
CN114663737B CN114663737B (en) 2022-12-02

Family

ID=82037221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210546400.7A Active CN114663737B (en) 2022-05-20 2022-05-20 Object identification method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114663737B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740662A (en) * 2023-08-15 2023-09-12 贵州中南锦天科技有限责任公司 Axle recognition method and system based on laser radar

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909605A (en) * 2019-10-24 2020-03-24 西北工业大学 Cross-modal pedestrian re-identification method based on contrast correlation
CN111931637A (en) * 2020-08-07 2020-11-13 华南理工大学 Cross-modal pedestrian re-identification method and system based on double-current convolutional neural network
CN112488111A (en) * 2020-12-18 2021-03-12 贵州大学 Instruction expression understanding method based on multi-level expression guide attention network
CN112651262A (en) * 2019-10-09 2021-04-13 四川大学 Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
CN112819011A (en) * 2021-01-28 2021-05-18 北京迈格威科技有限公司 Method and device for identifying relationships between objects and electronic system
US20210263961A1 (en) * 2020-02-26 2021-08-26 Samsung Electronics Co., Ltd. Coarse-to-fine multimodal gallery search system with attention-based neural network models
CN113449770A (en) * 2021-05-18 2021-09-28 科大讯飞股份有限公司 Image detection method, electronic device and storage device
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device
CN114201621A (en) * 2021-11-24 2022-03-18 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention
CN114359838A (en) * 2022-01-14 2022-04-15 北京理工大学重庆创新中心 Cross-modal pedestrian detection method based on Gaussian cross attention network

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651262A (en) * 2019-10-09 2021-04-13 四川大学 Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
CN110909605A (en) * 2019-10-24 2020-03-24 西北工业大学 Cross-modal pedestrian re-identification method based on contrast correlation
US20210263961A1 (en) * 2020-02-26 2021-08-26 Samsung Electronics Co., Ltd. Coarse-to-fine multimodal gallery search system with attention-based neural network models
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device
CN111931637A (en) * 2020-08-07 2020-11-13 华南理工大学 Cross-modal pedestrian re-identification method and system based on double-current convolutional neural network
CN112488111A (en) * 2020-12-18 2021-03-12 贵州大学 Instruction expression understanding method based on multi-level expression guide attention network
CN112819011A (en) * 2021-01-28 2021-05-18 北京迈格威科技有限公司 Method and device for identifying relationships between objects and electronic system
CN113449770A (en) * 2021-05-18 2021-09-28 科大讯飞股份有限公司 Image detection method, electronic device and storage device
CN114201621A (en) * 2021-11-24 2022-03-18 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention
CN114359838A (en) * 2022-01-14 2022-04-15 北京理工大学重庆创新中心 Cross-modal pedestrian detection method based on Gaussian cross attention network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SEN ZHANG ET AL.: "Cross-model identity correlation mining for visible-thermal person re-identification", 《MULTIMEDIA TOOLS AND APPLICATIONS》 *
XI WEI ET AL.: "Multi-modality cross attention network for image and sentence matching", 《THE COMPUTER VISION FOUNDATION》 *
张玉康等: "基于图像和特征联合约束的跨模态行人重识别", 《自动化学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740662A (en) * 2023-08-15 2023-09-12 贵州中南锦天科技有限责任公司 Axle recognition method and system based on laser radar
CN116740662B (en) * 2023-08-15 2023-11-21 贵州中南锦天科技有限责任公司 Axle recognition method and system based on laser radar

Also Published As

Publication number Publication date
CN114663737B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
Ke et al. Mask transfiner for high-quality instance segmentation
Chen et al. Remote sensing image change detection with transformers
CN113902926B (en) General image target detection method and device based on self-attention mechanism
CN112232164B (en) Video classification method and device
CN108171663B (en) Image filling system of convolutional neural network based on feature map nearest neighbor replacement
Duan et al. Compact descriptors for visual search
CN112200041B (en) Video motion recognition method and device, storage medium and electronic equipment
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
CN114663737B (en) Object identification method and device, electronic equipment and computer readable storage medium
US20240087343A1 (en) License plate classification method, license plate classification apparatus, and computer-readable storage medium
CN113537254B (en) Image feature extraction method and device, electronic equipment and readable storage medium
CN116258850A (en) Image semantic segmentation method, electronic device and computer readable storage medium
CN114140831B (en) Human body posture estimation method and device, electronic equipment and storage medium
CN112001931A (en) Image segmentation method, device, equipment and storage medium
CN114241499A (en) Table picture identification method, device and equipment and readable storage medium
CN111639230B (en) Similar video screening method, device, equipment and storage medium
EP4174789B1 (en) Method and apparatus of processing image, and storage medium
CN115310611A (en) Figure intention reasoning method and related device
CN113759338A (en) Target detection method and device, electronic equipment and storage medium
CN114821169A (en) Method-level non-intrusive call link tracking method under micro-service architecture
CN118152594A (en) News detection method, device and equipment containing misleading information
CN117115584A (en) Target detection method, device and server
CN116994264A (en) Text recognition method, chip and terminal
CN115690795A (en) Resume information extraction method and device, electronic equipment and storage medium
CN115631343A (en) Image generation method, device and equipment based on full pulse network and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant