CN113743544A - Cross-modal neural network construction method, pedestrian retrieval method and system - Google Patents
Cross-modal neural network construction method, pedestrian retrieval method and system Download PDFInfo
- Publication number
- CN113743544A CN113743544A CN202111302766.1A CN202111302766A CN113743544A CN 113743544 A CN113743544 A CN 113743544A CN 202111302766 A CN202111302766 A CN 202111302766A CN 113743544 A CN113743544 A CN 113743544A
- Authority
- CN
- China
- Prior art keywords
- image
- visible light
- infrared
- cross
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a cross-modal neural network construction method, a pedestrian retrieval method and a cross-modal neural network system, belongs to the technical field of data analysis and retrieval, and can solve the problems of low cross-modal identification precision and poor pedestrian identification effect. The method comprises the following steps: acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data; inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global characteristic and a visible light local characteristic; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature; and training the first neural network by utilizing the visible light global characteristic, the visible light local characteristic, the near-infrared global characteristic and the near-infrared local characteristic to obtain the cross-modal neural network. The method is used for cross-modal image recognition.
Description
Technical Field
The invention relates to a cross-modal neural network construction method, a pedestrian retrieval method and a system, and belongs to the technical field of data analysis and retrieval.
Background
In recent years, the development of the academic and industrial fields is continuously promoted by the continuous progress of the artificial intelligence technology, especially in the field of computer vision, from the traditional feature extraction to the current deep learning technology. The pedestrian re-recognition technology is another important human-centered research field which follows the face recognition technology, and the field has very important practical significance and business transformation prospect in the real society. The pedestrian re-identification (Person re-identification) aims at realizing cross-camera pedestrian feature extraction and retrieval by relying on monitoring equipment which is distributed all over the region and all scenes.
Most of the traditional pedestrian re-identification research methods focus on the problems of human body posture, background, illumination and the like under the condition of visible light. The method mainly adopts a pedestrian feature extraction or generation-based mode to realize pedestrian re-identification. In practical monitoring systems, especially under the condition of insufficient light or darkness, the camera usually needs to be switched to an infrared mode to acquire a pedestrian or target image, so that the problem of re-identification of the pedestrian image under daily visible light and the pedestrian under the near-infrared cross-modal mode has to be faced. The cross-modal pedestrian retrieval is to identify and compare pedestrians in a visible light state (natural state) and a near infrared state (state of different spectrums of the pedestrians captured by the camera). At present, the method mainly has two ideas, namely a pedestrian feature extraction method based on near infrared and visible light modes, for example, a plurality of sub-networks are respectively responsible for image input of near infrared and visible light, and then are fused to a shared network to learn fusion features; another method is to convert pedestrian images in two different modalities into the same modality based on a Generative Adaptive Networks (GANs) to convert into the same modality pedestrian re-identification process. However, in practical applications, the recognition accuracy of the two methods is not high, so that the pedestrian recognition effect is poor.
Disclosure of Invention
The invention provides a cross-modal neural network construction method, a pedestrian retrieval method and a system, which can solve the problems of low cross-modal identification precision and poor pedestrian identification effect in the prior art.
In one aspect, the present invention provides a method for constructing a cross-modal neural network, the method comprising:
and step 13, training a first neural network by using the visible light global feature, the visible light local feature, the near-infrared global feature and the near-infrared local feature to obtain a cross-modal neural network.
Optionally, the preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data specifically includes:
splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks respectively to form a visible light block sequence set and a near-infrared block sequence set; the image block splitting rules of the visible light sample image and the near infrared sample image are the same;
inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing position information of each near-infrared image block.
Optionally, step 13 specifically includes:
inputting the visible light global features into a first neural network, training a first preset mapping matrix, and obtaining a visible light global mapping matrix;
inputting the visible light local features into a first neural network, training a second preset mapping matrix, and obtaining a visible light local mapping matrix;
inputting the near-infrared global features into a first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;
inputting the near-infrared local features into a first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;
and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.
Optionally, the loss function across the modal neural network is:
wherein, L2 is to find L2 loss for two input vectors;globally mapping a matrix for visible light;is a visible light global feature;a near-infrared global mapping matrix is obtained;in order to be a near-infrared global feature,locally mapping a matrix for visible light;is a visible light local feature;a near infrared local mapping matrix is obtained;is a near infrared local feature;j=1,…,k ;kis the number of split image blocks.
Optionally, the loss function of the first self-attention mechanism module is:
wherein the content of the first and second substances,is the global loss of visible light;is a local loss of visible light;
the loss function of the second self-attention mechanism module is:
wherein the content of the first and second substances,is near infrared global loss;is a local loss of near infrared.
Optionally, the near-infrared global loss, the near-infrared local loss, the visible light global loss, and the visible light local loss are calculated in a cross entropy loss or local triple loss calculation manner.
In another aspect, the present invention provides a cross-modal pedestrian retrieval method, including:
step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
step 23, inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;
and 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
Optionally, the preprocessing the target pedestrian image to obtain target image block sequence data specifically includes:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
Optionally, if the target pedestrian image is a visible light image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;
if the target pedestrian image is a near-infrared image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.
In yet another aspect, the present invention provides a cross-modal pedestrian retrieval system, the system comprising:
the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
the second feature extraction module is used for inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;
and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global feature and/or the cross-modal local feature to obtain a cross-modal retrieval result.
Optionally, the preprocessing module is specifically configured to:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
The invention can produce the beneficial effects that:
according to the cross-modal pedestrian retrieval method provided by the invention, the serialized image blocks are sent to the transform module to obtain the global features and the local features of respective modes, and then the features are sent to the feature extraction network, so that the cross-modal global features and the cross-modal local features of cross-modal pedestrians can be simultaneously obtained, the cross-modal global features and the cross-modal local features are utilized to retrieve target pedestrians, and the performance of cross-modal retrieval can be improved. In addition, the learned local features provide convenience for pedestrian retrieval in specific scenes, for example, in the scenes with incomplete pedestrian images such as occlusion and blurring, the use of the local features is beneficial to further improvement of cross-modal retrieval accuracy.
Drawings
FIG. 1 is a flowchart of a cross-modal neural network construction method according to an embodiment of the present invention;
fig. 2 is a flowchart of a cross-modal pedestrian retrieval method according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a principle of a cross-modal pedestrian retrieval method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a cross-modal neural network feature extraction principle provided in the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to examples, but the present invention is not limited to these examples.
The embodiment of the invention provides a cross-modal neural network construction method, as shown in fig. 1, the method comprises the following steps:
and 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data.
The method specifically comprises the following steps: respectively splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks to form a visible light block sequence set and a near-infrared block sequence set;
inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing the position information of each near-infrared image block.
And the splitting rule of the image blocks of the visible light sample image and the near infrared sample image is the same.
The self-Attention mechanism module (i.e. transform structure) Is proposed by google in 17 years of Attention Is All You Need paper, and has a very good effect on multiple tasks of natural language processing, in which a word segmentation (token) in a sentence Is transformed to obtain embedded features, then information after an Attention structure Is added Is obtained by using the self-Attention mechanism module, and then an encoder structure and a decoder structure in natural language processing are formed by using a multi-layer stacked transform structure, so that the corresponding tasks are completed. In recent years, many people have been introduced into computer Vision, and particularly the Vision transform method is most effective, and a transform structure and a pre-training model on a large data set are introduced to fine-tune the image net, thereby achieving remarkable performance.
And step 13, training the first neural network by utilizing the visible light global characteristic, the visible light local characteristic, the near-infrared global characteristic and the near-infrared local characteristic to obtain the cross-modal neural network.
Specifically, the method comprises the following steps: inputting the global visible light features into a first neural network, training a first preset mapping matrix, and obtaining a global visible light mapping matrix;
inputting the local characteristics of the visible light into a first neural network, and training a second preset mapping matrix to obtain a local mapping matrix of the visible light;
inputting the near-infrared global features into the first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;
inputting the near-infrared local features into the first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;
and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.
Another embodiment of the present invention provides a cross-modal pedestrian retrieval method, as shown in fig. 2, the method includes:
step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image.
The method specifically comprises the following steps: splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
For example, when the target pedestrian image is a visible light image, the target image block sequence set is input into the first linear projection module; and when the target pedestrian image is a near-infrared image, inputting the target image block sequence set into the second linear projection module.
And step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain the global feature and the local feature of the target image.
Illustratively, when the target pedestrian image is a visible light image, the target image block sequence data is input into the first self-attention mechanism module; and when the target pedestrian image is a near-infrared image, inputting the target image block sequence data into a second self-attention mechanism module.
Step 23, inputting the global features and the local features of the target image into a cross-modal neural network to obtain the cross-modal global features and the cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks.
For example, if the target pedestrian image is a visible light image, step 23 specifically includes:
inputting the global features of the target image into a cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into a cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;
if the target pedestrian image is a near-infrared image, step 23 specifically includes:
inputting the global features of the target image into a cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into a cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.
And 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
Referring to fig. 3, in the preprocessing stage of the image, the images of different modalities, i.e. the visible light image (shown in fig. 3 as a gray image, which is a color image in practice) and the near-infrared image, are split into a plurality of blocks, which may not be overlapped or overlapped, but ensure that the splitting rules of different modalities are consistent, so as to form a block sequence of the images of respective modalities. Based on a traditional transform method, sequence image sets of two modalities are sent to respective linear projection modules, then position information of each image block is added and sent to respective transform modules, and global features and local features of respective modalities are obtained. Here, the module embedded in the block information may be learned, and the corresponding output classification information encoding layer is the global feature of the full image, and each corresponding block is output as the local feature containing the position information of the full image after passing through the encoding layer. The global features and the local features of the visible light and the near infrared are sent to a cross-modal neural network for feature learning, so that a feature set suitable for cross-modal retrieval is further extracted, namely the cross-modal global features and the cross-modal local features.
In this process, first, in the training of the respective modalities, the global feature and the local feature of the respective modalities need to converge, that is:
wherein the content of the first and second substances,is the global loss of visible light;is a local loss of visible light;
wherein the content of the first and second substances,is near infrared global loss;is a local loss of near infrared.
Among them, near infrared modality uses ""for visible light modality""means that the number of the terminal is,for the number of image blocks, for loss calculation ""means. And the loss calculation can adopt the combination of cross entropy loss, local triple loss and the like to improve the characterization capability of the features.
In the feature set across the modes, how to map the shared better global features and local features from the global features and local features of the two modes needs to be considered, namely, a global mapping matrix is needed to pass through each mode,To the same space, and the mapping characteristics of the local characteristics are similar. As shown with reference to figure 4 of the drawings,for global features learned across modes, which are adapted to cross-mode retrieval, d1 is the dimension of the global feature,for local features learned across modalities that are adapted to cross-modality retrieval, d2 is the dimension of the local feature. Then can obtain
The loss function across the modal neural network can be defined as:
wherein, L2 is to find L2 loss for two input vectors;to be visibleAn optical global mapping matrix;is a visible light global feature;a near-infrared global mapping matrix is obtained;in order to be a near-infrared global feature,locally mapping a matrix for visible light;is a visible light local feature;a near infrared local mapping matrix is obtained;is a near infrared local feature;j=1,…,k ;kis the number of split image blocks.
It is desirable to obtain a corresponding mapping matrix by optimizing the objective function to realize learning of a better feature subset, and it should be noted that in the cross-modal feature learning, global features and local features of visible light and near-infrared images are cooperatively learned, so that a better feature set adapted to cross-modal recognition can be obtained to improve the cross-modal retrieval performance.
In the overall cross-modal feature learning, the learning of a transform network inside each modality is considered, and the overall loss functionThen the definition is as follows:
wherein the content of the first and second substances,is a loss weight used to balance and adjust the measure of loss from modality and loss across modalities.
The invention adopts the simultaneous retrieval of the global characteristic and the local characteristic, and can improve the accuracy of cross-modal retrieval. In addition, when the global features are invalid due to incomplete global information of pedestrians caused by possible situations of shielding, blurring and the like in some complex scenes, local features can be adopted for targeted retrieval. That is, when a search image is obtained, a certain local area of the search image may be specified, after the global and local features of the image are extracted, an image set similar to the local features may be queried in a cross-modal database by using the corresponding local features, and a query result is returned. The invention can flexibly apply local characteristics to carry out cross-modal retrieval, and improves the retrieval precision and the retrieval flexibility of the whole system.
Another embodiment of the present invention provides a cross-modal pedestrian retrieval system, including:
the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
the second feature extraction module is used for inputting the global features and the local features of the target image into the cross-modal neural network to obtain the cross-modal global features and the cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;
and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by utilizing the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
Further, the preprocessing module is specifically configured to:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
The functional explanation of each module in the above retrieval system may refer to the explanation of each step in the retrieval method, and is not described herein again.
Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.
Claims (10)
1. A method for constructing a cross-modal neural network, the method comprising:
step 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data;
step 12, inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global feature and a visible light local feature; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature;
and step 13, training a first neural network by using the visible light global feature, the visible light local feature, the near-infrared global feature and the near-infrared local feature to obtain a cross-modal neural network.
2. The method according to claim 1, wherein the preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data comprises:
splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks respectively to form a visible light block sequence set and a near-infrared block sequence set; the image block splitting rules of the visible light sample image and the near infrared sample image are the same;
inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing position information of each near-infrared image block.
3. The method according to claim 1, characterized in that said step 13 comprises in particular:
inputting the visible light global features into a first neural network, training a first preset mapping matrix, and obtaining a visible light global mapping matrix;
inputting the visible light local features into a first neural network, training a second preset mapping matrix, and obtaining a visible light local mapping matrix;
inputting the near-infrared global features into a first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;
inputting the near-infrared local features into a first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;
and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.
4. The method of claim 1, wherein the loss function across the modal neural network is:
wherein, L2 is to find L2 loss for two input vectors;globally mapping a matrix for visible light;is a visible light global feature;a near-infrared global mapping matrix is obtained;in order to be a near-infrared global feature,locally mapping a matrix for visible light;is a visible light local feature;a near infrared local mapping matrix is obtained;is a near infrared local feature;j=1,…,k ;kis the number of split image blocks.
5. The method of claim 1, wherein the loss function of the first self-attention mechanism module is:
wherein the content of the first and second substances,is the global loss of visible light;is a local loss of visible light;
the loss function of the second self-attention mechanism module is:
6. A cross-modal pedestrian retrieval method, comprising:
step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
step 23, inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is the cross-modal neural network of any one of claims 1 to 5;
and 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
7. The method according to claim 6, wherein the preprocessing the image of the target pedestrian to obtain the target image block sequence data specifically comprises:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
8. The method according to claim 6, wherein if the target pedestrian image is a visible light image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;
if the target pedestrian image is a near-infrared image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.
9. A cross-modal pedestrian retrieval system, the system comprising:
the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
the second feature extraction module is used for inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is the cross-modal neural network of any one of claims 1 to 5;
and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global feature and/or the cross-modal local feature to obtain a cross-modal retrieval result.
10. The system of claim 9, wherein the preprocessing module is specifically configured to:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111302766.1A CN113743544A (en) | 2021-11-05 | 2021-11-05 | Cross-modal neural network construction method, pedestrian retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111302766.1A CN113743544A (en) | 2021-11-05 | 2021-11-05 | Cross-modal neural network construction method, pedestrian retrieval method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113743544A true CN113743544A (en) | 2021-12-03 |
Family
ID=78727537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111302766.1A Pending CN113743544A (en) | 2021-11-05 | 2021-11-05 | Cross-modal neural network construction method, pedestrian retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113743544A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114663839A (en) * | 2022-05-12 | 2022-06-24 | 中科智为科技(天津)有限公司 | Method and system for re-identifying blocked pedestrians |
CN114694185A (en) * | 2022-05-31 | 2022-07-01 | 浪潮电子信息产业股份有限公司 | Cross-modal target re-identification method, device, equipment and medium |
CN115050044A (en) * | 2022-04-02 | 2022-09-13 | 广西科学院 | Cross-modal pedestrian re-identification method based on MLP-Mixer |
CN117576520A (en) * | 2024-01-16 | 2024-02-20 | 中国科学技术大学 | Training method of target detection model, target detection method and electronic equipment |
CN117934309A (en) * | 2024-03-18 | 2024-04-26 | 昆明理工大学 | Unregistered infrared visible image fusion method based on modal dictionary and feature matching |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255047A (en) * | 2018-07-18 | 2019-01-22 | 西安电子科技大学 | Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve |
CN110598654A (en) * | 2019-09-18 | 2019-12-20 | 合肥工业大学 | Multi-granularity cross modal feature fusion pedestrian re-identification method and re-identification system |
US20210056764A1 (en) * | 2018-05-22 | 2021-02-25 | Magic Leap, Inc. | Transmodal input fusion for a wearable system |
CN112434796A (en) * | 2020-12-09 | 2021-03-02 | 同济大学 | Cross-modal pedestrian re-identification method based on local information learning |
CN112528866A (en) * | 2020-12-14 | 2021-03-19 | 奥比中光科技集团股份有限公司 | Cross-modal face recognition method, device, equipment and storage medium |
CN113487609A (en) * | 2021-09-06 | 2021-10-08 | 北京字节跳动网络技术有限公司 | Tissue cavity positioning method and device, readable medium and electronic equipment |
-
2021
- 2021-11-05 CN CN202111302766.1A patent/CN113743544A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210056764A1 (en) * | 2018-05-22 | 2021-02-25 | Magic Leap, Inc. | Transmodal input fusion for a wearable system |
CN109255047A (en) * | 2018-07-18 | 2019-01-22 | 西安电子科技大学 | Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve |
CN110598654A (en) * | 2019-09-18 | 2019-12-20 | 合肥工业大学 | Multi-granularity cross modal feature fusion pedestrian re-identification method and re-identification system |
CN112434796A (en) * | 2020-12-09 | 2021-03-02 | 同济大学 | Cross-modal pedestrian re-identification method based on local information learning |
CN112528866A (en) * | 2020-12-14 | 2021-03-19 | 奥比中光科技集团股份有限公司 | Cross-modal face recognition method, device, equipment and storage medium |
CN113487609A (en) * | 2021-09-06 | 2021-10-08 | 北京字节跳动网络技术有限公司 | Tissue cavity positioning method and device, readable medium and electronic equipment |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115050044A (en) * | 2022-04-02 | 2022-09-13 | 广西科学院 | Cross-modal pedestrian re-identification method based on MLP-Mixer |
CN114663839A (en) * | 2022-05-12 | 2022-06-24 | 中科智为科技(天津)有限公司 | Method and system for re-identifying blocked pedestrians |
CN114663839B (en) * | 2022-05-12 | 2022-11-04 | 中科智为科技(天津)有限公司 | Method and system for re-identifying blocked pedestrians |
CN114694185A (en) * | 2022-05-31 | 2022-07-01 | 浪潮电子信息产业股份有限公司 | Cross-modal target re-identification method, device, equipment and medium |
CN114694185B (en) * | 2022-05-31 | 2022-11-04 | 浪潮电子信息产业股份有限公司 | Cross-modal target re-identification method, device, equipment and medium |
CN117576520A (en) * | 2024-01-16 | 2024-02-20 | 中国科学技术大学 | Training method of target detection model, target detection method and electronic equipment |
CN117576520B (en) * | 2024-01-16 | 2024-05-17 | 中国科学技术大学 | Training method of target detection model, target detection method and electronic equipment |
CN117934309A (en) * | 2024-03-18 | 2024-04-26 | 昆明理工大学 | Unregistered infrared visible image fusion method based on modal dictionary and feature matching |
CN117934309B (en) * | 2024-03-18 | 2024-05-24 | 昆明理工大学 | Unregistered infrared visible image fusion method based on modal dictionary and feature matching |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ahila Priyadharshini et al. | A deep learning approach for person identification using ear biometrics | |
CN109948475B (en) | Human body action recognition method based on skeleton features and deep learning | |
CN113743544A (en) | Cross-modal neural network construction method, pedestrian retrieval method and system | |
CN106919920B (en) | Scene recognition method based on convolution characteristics and space vision bag-of-words model | |
CN110135249B (en) | Human behavior identification method based on time attention mechanism and LSTM (least Square TM) | |
Ferreira et al. | Physiological inspired deep neural networks for emotion recognition | |
WO2021155792A1 (en) | Processing apparatus, method and storage medium | |
Zhang et al. | Image-to-video person re-identification with temporally memorized similarity learning | |
CN112307995B (en) | Semi-supervised pedestrian re-identification method based on feature decoupling learning | |
Komorowski et al. | Minkloc++: lidar and monocular image fusion for place recognition | |
CN112651262B (en) | Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment | |
Gou et al. | Cascade learning from adversarial synthetic images for accurate pupil detection | |
CN110222718B (en) | Image processing method and device | |
CN110135277B (en) | Human behavior recognition method based on convolutional neural network | |
CN113870160A (en) | Point cloud data processing method based on converter neural network | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
Hu et al. | A spatio-temporal integrated model based on local and global features for video expression recognition | |
CN113343966B (en) | Infrared and visible light image text description generation method | |
Sajid et al. | Facial asymmetry-based feature extraction for different applications: a review complemented by new advances | |
Guo et al. | Facial expression recognition: a review | |
CN117033609A (en) | Text visual question-answering method, device, computer equipment and storage medium | |
Cui et al. | Multisource learning for skeleton-based action recognition using deep LSTM and CNN | |
Afrasiabi et al. | Spatial-temporal dual-actor CNN for human interaction prediction in video | |
Hashim et al. | An Optimized Image Annotation Method Utilizing Integrating Neural Networks Model and Slantlet Transformation | |
Zhao et al. | Research on human behavior recognition in video based on 3DCCA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211203 |
|
RJ01 | Rejection of invention patent application after publication |