CN113743544A - Cross-modal neural network construction method, pedestrian retrieval method and system - Google Patents

Cross-modal neural network construction method, pedestrian retrieval method and system Download PDF

Info

Publication number
CN113743544A
CN113743544A CN202111302766.1A CN202111302766A CN113743544A CN 113743544 A CN113743544 A CN 113743544A CN 202111302766 A CN202111302766 A CN 202111302766A CN 113743544 A CN113743544 A CN 113743544A
Authority
CN
China
Prior art keywords
image
visible light
infrared
cross
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111302766.1A
Other languages
Chinese (zh)
Inventor
张德馨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongkezhiwei Technology Tianjin Co ltd
Original Assignee
Zhongkezhiwei Technology Tianjin Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongkezhiwei Technology Tianjin Co ltd filed Critical Zhongkezhiwei Technology Tianjin Co ltd
Priority to CN202111302766.1A priority Critical patent/CN113743544A/en
Publication of CN113743544A publication Critical patent/CN113743544A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a cross-modal neural network construction method, a pedestrian retrieval method and a cross-modal neural network system, belongs to the technical field of data analysis and retrieval, and can solve the problems of low cross-modal identification precision and poor pedestrian identification effect. The method comprises the following steps: acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data; inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global characteristic and a visible light local characteristic; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature; and training the first neural network by utilizing the visible light global characteristic, the visible light local characteristic, the near-infrared global characteristic and the near-infrared local characteristic to obtain the cross-modal neural network. The method is used for cross-modal image recognition.

Description

Cross-modal neural network construction method, pedestrian retrieval method and system
Technical Field
The invention relates to a cross-modal neural network construction method, a pedestrian retrieval method and a system, and belongs to the technical field of data analysis and retrieval.
Background
In recent years, the development of the academic and industrial fields is continuously promoted by the continuous progress of the artificial intelligence technology, especially in the field of computer vision, from the traditional feature extraction to the current deep learning technology. The pedestrian re-recognition technology is another important human-centered research field which follows the face recognition technology, and the field has very important practical significance and business transformation prospect in the real society. The pedestrian re-identification (Person re-identification) aims at realizing cross-camera pedestrian feature extraction and retrieval by relying on monitoring equipment which is distributed all over the region and all scenes.
Most of the traditional pedestrian re-identification research methods focus on the problems of human body posture, background, illumination and the like under the condition of visible light. The method mainly adopts a pedestrian feature extraction or generation-based mode to realize pedestrian re-identification. In practical monitoring systems, especially under the condition of insufficient light or darkness, the camera usually needs to be switched to an infrared mode to acquire a pedestrian or target image, so that the problem of re-identification of the pedestrian image under daily visible light and the pedestrian under the near-infrared cross-modal mode has to be faced. The cross-modal pedestrian retrieval is to identify and compare pedestrians in a visible light state (natural state) and a near infrared state (state of different spectrums of the pedestrians captured by the camera). At present, the method mainly has two ideas, namely a pedestrian feature extraction method based on near infrared and visible light modes, for example, a plurality of sub-networks are respectively responsible for image input of near infrared and visible light, and then are fused to a shared network to learn fusion features; another method is to convert pedestrian images in two different modalities into the same modality based on a Generative Adaptive Networks (GANs) to convert into the same modality pedestrian re-identification process. However, in practical applications, the recognition accuracy of the two methods is not high, so that the pedestrian recognition effect is poor.
Disclosure of Invention
The invention provides a cross-modal neural network construction method, a pedestrian retrieval method and a system, which can solve the problems of low cross-modal identification precision and poor pedestrian identification effect in the prior art.
In one aspect, the present invention provides a method for constructing a cross-modal neural network, the method comprising:
step 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data;
step 12, inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global feature and a visible light local feature; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature;
and step 13, training a first neural network by using the visible light global feature, the visible light local feature, the near-infrared global feature and the near-infrared local feature to obtain a cross-modal neural network.
Optionally, the preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data specifically includes:
splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks respectively to form a visible light block sequence set and a near-infrared block sequence set; the image block splitting rules of the visible light sample image and the near infrared sample image are the same;
inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing position information of each near-infrared image block.
Optionally, step 13 specifically includes:
inputting the visible light global features into a first neural network, training a first preset mapping matrix, and obtaining a visible light global mapping matrix;
inputting the visible light local features into a first neural network, training a second preset mapping matrix, and obtaining a visible light local mapping matrix;
inputting the near-infrared global features into a first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;
inputting the near-infrared local features into a first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;
and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.
Optionally, the loss function across the modal neural network is:
Figure 613935DEST_PATH_IMAGE001
wherein, L2 is to find L2 loss for two input vectors;
Figure 580754DEST_PATH_IMAGE002
globally mapping a matrix for visible light;
Figure 945876DEST_PATH_IMAGE003
is a visible light global feature;
Figure 630935DEST_PATH_IMAGE004
a near-infrared global mapping matrix is obtained;
Figure 90604DEST_PATH_IMAGE005
in order to be a near-infrared global feature,
Figure 911930DEST_PATH_IMAGE006
locally mapping a matrix for visible light;
Figure 182374DEST_PATH_IMAGE007
is a visible light local feature;
Figure 89150DEST_PATH_IMAGE008
a near infrared local mapping matrix is obtained;
Figure 40926DEST_PATH_IMAGE009
is a near infrared local feature;j=1,…,k kis the number of split image blocks.
Optionally, the loss function of the first self-attention mechanism module is:
Figure 592124DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 174415DEST_PATH_IMAGE011
is the global loss of visible light;
Figure 693121DEST_PATH_IMAGE012
is a local loss of visible light;
the loss function of the second self-attention mechanism module is:
Figure 120691DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 290510DEST_PATH_IMAGE014
is near infrared global loss;
Figure 43703DEST_PATH_IMAGE015
is a local loss of near infrared.
Optionally, the near-infrared global loss, the near-infrared local loss, the visible light global loss, and the visible light local loss are calculated in a cross entropy loss or local triple loss calculation manner.
In another aspect, the present invention provides a cross-modal pedestrian retrieval method, including:
step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
step 23, inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;
and 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
Optionally, the preprocessing the target pedestrian image to obtain target image block sequence data specifically includes:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
Optionally, if the target pedestrian image is a visible light image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;
if the target pedestrian image is a near-infrared image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.
In yet another aspect, the present invention provides a cross-modal pedestrian retrieval system, the system comprising:
the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
the second feature extraction module is used for inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;
and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global feature and/or the cross-modal local feature to obtain a cross-modal retrieval result.
Optionally, the preprocessing module is specifically configured to:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
The invention can produce the beneficial effects that:
according to the cross-modal pedestrian retrieval method provided by the invention, the serialized image blocks are sent to the transform module to obtain the global features and the local features of respective modes, and then the features are sent to the feature extraction network, so that the cross-modal global features and the cross-modal local features of cross-modal pedestrians can be simultaneously obtained, the cross-modal global features and the cross-modal local features are utilized to retrieve target pedestrians, and the performance of cross-modal retrieval can be improved. In addition, the learned local features provide convenience for pedestrian retrieval in specific scenes, for example, in the scenes with incomplete pedestrian images such as occlusion and blurring, the use of the local features is beneficial to further improvement of cross-modal retrieval accuracy.
Drawings
FIG. 1 is a flowchart of a cross-modal neural network construction method according to an embodiment of the present invention;
fig. 2 is a flowchart of a cross-modal pedestrian retrieval method according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a principle of a cross-modal pedestrian retrieval method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a cross-modal neural network feature extraction principle provided in the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to examples, but the present invention is not limited to these examples.
The embodiment of the invention provides a cross-modal neural network construction method, as shown in fig. 1, the method comprises the following steps:
and 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data.
The method specifically comprises the following steps: respectively splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks to form a visible light block sequence set and a near-infrared block sequence set;
inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing the position information of each near-infrared image block.
And the splitting rule of the image blocks of the visible light sample image and the near infrared sample image is the same.
Step 12, inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global feature and a visible light local feature; and inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature.
The self-Attention mechanism module (i.e. transform structure) Is proposed by google in 17 years of Attention Is All You Need paper, and has a very good effect on multiple tasks of natural language processing, in which a word segmentation (token) in a sentence Is transformed to obtain embedded features, then information after an Attention structure Is added Is obtained by using the self-Attention mechanism module, and then an encoder structure and a decoder structure in natural language processing are formed by using a multi-layer stacked transform structure, so that the corresponding tasks are completed. In recent years, many people have been introduced into computer Vision, and particularly the Vision transform method is most effective, and a transform structure and a pre-training model on a large data set are introduced to fine-tune the image net, thereby achieving remarkable performance.
And step 13, training the first neural network by utilizing the visible light global characteristic, the visible light local characteristic, the near-infrared global characteristic and the near-infrared local characteristic to obtain the cross-modal neural network.
Specifically, the method comprises the following steps: inputting the global visible light features into a first neural network, training a first preset mapping matrix, and obtaining a global visible light mapping matrix;
inputting the local characteristics of the visible light into a first neural network, and training a second preset mapping matrix to obtain a local mapping matrix of the visible light;
inputting the near-infrared global features into the first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;
inputting the near-infrared local features into the first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;
and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.
Another embodiment of the present invention provides a cross-modal pedestrian retrieval method, as shown in fig. 2, the method includes:
step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image.
The method specifically comprises the following steps: splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
For example, when the target pedestrian image is a visible light image, the target image block sequence set is input into the first linear projection module; and when the target pedestrian image is a near-infrared image, inputting the target image block sequence set into the second linear projection module.
And step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain the global feature and the local feature of the target image.
Illustratively, when the target pedestrian image is a visible light image, the target image block sequence data is input into the first self-attention mechanism module; and when the target pedestrian image is a near-infrared image, inputting the target image block sequence data into a second self-attention mechanism module.
Step 23, inputting the global features and the local features of the target image into a cross-modal neural network to obtain the cross-modal global features and the cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks.
For example, if the target pedestrian image is a visible light image, step 23 specifically includes:
inputting the global features of the target image into a cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into a cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;
if the target pedestrian image is a near-infrared image, step 23 specifically includes:
inputting the global features of the target image into a cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into a cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.
And 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
Referring to fig. 3, in the preprocessing stage of the image, the images of different modalities, i.e. the visible light image (shown in fig. 3 as a gray image, which is a color image in practice) and the near-infrared image, are split into a plurality of blocks, which may not be overlapped or overlapped, but ensure that the splitting rules of different modalities are consistent, so as to form a block sequence of the images of respective modalities. Based on a traditional transform method, sequence image sets of two modalities are sent to respective linear projection modules, then position information of each image block is added and sent to respective transform modules, and global features and local features of respective modalities are obtained. Here, the module embedded in the block information may be learned, and the corresponding output classification information encoding layer is the global feature of the full image, and each corresponding block is output as the local feature containing the position information of the full image after passing through the encoding layer. The global features and the local features of the visible light and the near infrared are sent to a cross-modal neural network for feature learning, so that a feature set suitable for cross-modal retrieval is further extracted, namely the cross-modal global features and the cross-modal local features.
In this process, first, in the training of the respective modalities, the global feature and the local feature of the respective modalities need to converge, that is:
the loss function of the first self-attention mechanism module is:
Figure 784126DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 280966DEST_PATH_IMAGE017
is the global loss of visible light;
Figure 806756DEST_PATH_IMAGE018
is a local loss of visible light;
the loss function of the second self-attention mechanism module is:
Figure 527588DEST_PATH_IMAGE019
wherein the content of the first and second substances,
Figure 896252DEST_PATH_IMAGE020
is near infrared global loss;
Figure 55838DEST_PATH_IMAGE021
is a local loss of near infrared.
Among them, near infrared modality uses "
Figure 560769DEST_PATH_IMAGE022
"for visible light modality"
Figure 764086DEST_PATH_IMAGE023
"means that the number of the terminal is,
Figure 354467DEST_PATH_IMAGE024
for the number of image blocks, for loss calculation "
Figure 989848DEST_PATH_IMAGE025
"means. And the loss calculation can adopt the combination of cross entropy loss, local triple loss and the like to improve the characterization capability of the features.
In the feature set across the modes, how to map the shared better global features and local features from the global features and local features of the two modes needs to be considered, namely, a global mapping matrix is needed to pass through each mode
Figure 739498DEST_PATH_IMAGE026
Figure 84023DEST_PATH_IMAGE027
To the same space, and the mapping characteristics of the local characteristics are similar
Figure 958438DEST_PATH_IMAGE028
. As shown with reference to figure 4 of the drawings,
Figure 335193DEST_PATH_IMAGE029
for global features learned across modes, which are adapted to cross-mode retrieval, d1 is the dimension of the global feature,
Figure 939349DEST_PATH_IMAGE030
for local features learned across modalities that are adapted to cross-modality retrieval, d2 is the dimension of the local feature. Then can obtain
Figure 110568DEST_PATH_IMAGE031
The loss function across the modal neural network can be defined as:
Figure 72880DEST_PATH_IMAGE032
wherein, L2 is to find L2 loss for two input vectors;
Figure 253326DEST_PATH_IMAGE033
to be visibleAn optical global mapping matrix;
Figure 977568DEST_PATH_IMAGE034
is a visible light global feature;
Figure 585267DEST_PATH_IMAGE035
a near-infrared global mapping matrix is obtained;
Figure 247324DEST_PATH_IMAGE036
in order to be a near-infrared global feature,
Figure 28198DEST_PATH_IMAGE037
locally mapping a matrix for visible light;
Figure 482313DEST_PATH_IMAGE038
is a visible light local feature;
Figure 385547DEST_PATH_IMAGE039
a near infrared local mapping matrix is obtained;
Figure 393954DEST_PATH_IMAGE040
is a near infrared local feature;j=1,…,k kis the number of split image blocks.
It is desirable to obtain a corresponding mapping matrix by optimizing the objective function to realize learning of a better feature subset, and it should be noted that in the cross-modal feature learning, global features and local features of visible light and near-infrared images are cooperatively learned, so that a better feature set adapted to cross-modal recognition can be obtained to improve the cross-modal retrieval performance.
In the overall cross-modal feature learning, the learning of a transform network inside each modality is considered, and the overall loss function
Figure 290104DEST_PATH_IMAGE041
Then the definition is as follows:
Figure 598725DEST_PATH_IMAGE042
wherein the content of the first and second substances,
Figure 344964DEST_PATH_IMAGE043
is a loss weight used to balance and adjust the measure of loss from modality and loss across modalities.
The invention adopts the simultaneous retrieval of the global characteristic and the local characteristic, and can improve the accuracy of cross-modal retrieval. In addition, when the global features are invalid due to incomplete global information of pedestrians caused by possible situations of shielding, blurring and the like in some complex scenes, local features can be adopted for targeted retrieval. That is, when a search image is obtained, a certain local area of the search image may be specified, after the global and local features of the image are extracted, an image set similar to the local features may be queried in a cross-modal database by using the corresponding local features, and a query result is returned. The invention can flexibly apply local characteristics to carry out cross-modal retrieval, and improves the retrieval precision and the retrieval flexibility of the whole system.
Another embodiment of the present invention provides a cross-modal pedestrian retrieval system, including:
the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
the second feature extraction module is used for inputting the global features and the local features of the target image into the cross-modal neural network to obtain the cross-modal global features and the cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;
and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by utilizing the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
Further, the preprocessing module is specifically configured to:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
The functional explanation of each module in the above retrieval system may refer to the explanation of each step in the retrieval method, and is not described herein again.
Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims (10)

1. A method for constructing a cross-modal neural network, the method comprising:
step 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data;
step 12, inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global feature and a visible light local feature; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature;
and step 13, training a first neural network by using the visible light global feature, the visible light local feature, the near-infrared global feature and the near-infrared local feature to obtain a cross-modal neural network.
2. The method according to claim 1, wherein the preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data comprises:
splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks respectively to form a visible light block sequence set and a near-infrared block sequence set; the image block splitting rules of the visible light sample image and the near infrared sample image are the same;
inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing position information of each near-infrared image block.
3. The method according to claim 1, characterized in that said step 13 comprises in particular:
inputting the visible light global features into a first neural network, training a first preset mapping matrix, and obtaining a visible light global mapping matrix;
inputting the visible light local features into a first neural network, training a second preset mapping matrix, and obtaining a visible light local mapping matrix;
inputting the near-infrared global features into a first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;
inputting the near-infrared local features into a first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;
and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.
4. The method of claim 1, wherein the loss function across the modal neural network is:
Figure 362426DEST_PATH_IMAGE001
wherein, L2 is to find L2 loss for two input vectors;
Figure 184889DEST_PATH_IMAGE002
globally mapping a matrix for visible light;
Figure 45397DEST_PATH_IMAGE003
is a visible light global feature;
Figure 447560DEST_PATH_IMAGE004
a near-infrared global mapping matrix is obtained;
Figure 959181DEST_PATH_IMAGE005
in order to be a near-infrared global feature,
Figure 686966DEST_PATH_IMAGE006
locally mapping a matrix for visible light;
Figure 769192DEST_PATH_IMAGE007
is a visible light local feature;
Figure 240624DEST_PATH_IMAGE008
a near infrared local mapping matrix is obtained;
Figure 842638DEST_PATH_IMAGE009
is a near infrared local feature;j=1,…,kkis the number of split image blocks.
5. The method of claim 1, wherein the loss function of the first self-attention mechanism module is:
Figure 741324DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 576425DEST_PATH_IMAGE011
is the global loss of visible light;
Figure 585969DEST_PATH_IMAGE012
is a local loss of visible light;
the loss function of the second self-attention mechanism module is:
Figure 9866DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 204087DEST_PATH_IMAGE014
is near infrared global loss;
Figure 136271DEST_PATH_IMAGE015
is a local loss of near infrared.
6. A cross-modal pedestrian retrieval method, comprising:
step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
step 23, inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is the cross-modal neural network of any one of claims 1 to 5;
and 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.
7. The method according to claim 6, wherein the preprocessing the image of the target pedestrian to obtain the target image block sequence data specifically comprises:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
8. The method according to claim 6, wherein if the target pedestrian image is a visible light image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;
if the target pedestrian image is a near-infrared image, the step 23 specifically includes:
inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;
inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.
9. A cross-modal pedestrian retrieval system, the system comprising:
the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;
the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;
the second feature extraction module is used for inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is the cross-modal neural network of any one of claims 1 to 5;
and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global feature and/or the cross-modal local feature to obtain a cross-modal retrieval result.
10. The system of claim 9, wherein the preprocessing module is specifically configured to:
splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;
and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.
CN202111302766.1A 2021-11-05 2021-11-05 Cross-modal neural network construction method, pedestrian retrieval method and system Pending CN113743544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111302766.1A CN113743544A (en) 2021-11-05 2021-11-05 Cross-modal neural network construction method, pedestrian retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111302766.1A CN113743544A (en) 2021-11-05 2021-11-05 Cross-modal neural network construction method, pedestrian retrieval method and system

Publications (1)

Publication Number Publication Date
CN113743544A true CN113743544A (en) 2021-12-03

Family

ID=78727537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111302766.1A Pending CN113743544A (en) 2021-11-05 2021-11-05 Cross-modal neural network construction method, pedestrian retrieval method and system

Country Status (1)

Country Link
CN (1) CN113743544A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663839A (en) * 2022-05-12 2022-06-24 中科智为科技(天津)有限公司 Method and system for re-identifying blocked pedestrians
CN114694185A (en) * 2022-05-31 2022-07-01 浪潮电子信息产业股份有限公司 Cross-modal target re-identification method, device, equipment and medium
CN115050044A (en) * 2022-04-02 2022-09-13 广西科学院 Cross-modal pedestrian re-identification method based on MLP-Mixer
CN117576520A (en) * 2024-01-16 2024-02-20 中国科学技术大学 Training method of target detection model, target detection method and electronic equipment
CN117934309A (en) * 2024-03-18 2024-04-26 昆明理工大学 Unregistered infrared visible image fusion method based on modal dictionary and feature matching

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255047A (en) * 2018-07-18 2019-01-22 西安电子科技大学 Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve
CN110598654A (en) * 2019-09-18 2019-12-20 合肥工业大学 Multi-granularity cross modal feature fusion pedestrian re-identification method and re-identification system
US20210056764A1 (en) * 2018-05-22 2021-02-25 Magic Leap, Inc. Transmodal input fusion for a wearable system
CN112434796A (en) * 2020-12-09 2021-03-02 同济大学 Cross-modal pedestrian re-identification method based on local information learning
CN112528866A (en) * 2020-12-14 2021-03-19 奥比中光科技集团股份有限公司 Cross-modal face recognition method, device, equipment and storage medium
CN113487609A (en) * 2021-09-06 2021-10-08 北京字节跳动网络技术有限公司 Tissue cavity positioning method and device, readable medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210056764A1 (en) * 2018-05-22 2021-02-25 Magic Leap, Inc. Transmodal input fusion for a wearable system
CN109255047A (en) * 2018-07-18 2019-01-22 西安电子科技大学 Based on the complementary semantic mutual search method of image-text being aligned and symmetrically retrieve
CN110598654A (en) * 2019-09-18 2019-12-20 合肥工业大学 Multi-granularity cross modal feature fusion pedestrian re-identification method and re-identification system
CN112434796A (en) * 2020-12-09 2021-03-02 同济大学 Cross-modal pedestrian re-identification method based on local information learning
CN112528866A (en) * 2020-12-14 2021-03-19 奥比中光科技集团股份有限公司 Cross-modal face recognition method, device, equipment and storage medium
CN113487609A (en) * 2021-09-06 2021-10-08 北京字节跳动网络技术有限公司 Tissue cavity positioning method and device, readable medium and electronic equipment

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115050044A (en) * 2022-04-02 2022-09-13 广西科学院 Cross-modal pedestrian re-identification method based on MLP-Mixer
CN114663839A (en) * 2022-05-12 2022-06-24 中科智为科技(天津)有限公司 Method and system for re-identifying blocked pedestrians
CN114663839B (en) * 2022-05-12 2022-11-04 中科智为科技(天津)有限公司 Method and system for re-identifying blocked pedestrians
CN114694185A (en) * 2022-05-31 2022-07-01 浪潮电子信息产业股份有限公司 Cross-modal target re-identification method, device, equipment and medium
CN114694185B (en) * 2022-05-31 2022-11-04 浪潮电子信息产业股份有限公司 Cross-modal target re-identification method, device, equipment and medium
CN117576520A (en) * 2024-01-16 2024-02-20 中国科学技术大学 Training method of target detection model, target detection method and electronic equipment
CN117576520B (en) * 2024-01-16 2024-05-17 中国科学技术大学 Training method of target detection model, target detection method and electronic equipment
CN117934309A (en) * 2024-03-18 2024-04-26 昆明理工大学 Unregistered infrared visible image fusion method based on modal dictionary and feature matching
CN117934309B (en) * 2024-03-18 2024-05-24 昆明理工大学 Unregistered infrared visible image fusion method based on modal dictionary and feature matching

Similar Documents

Publication Publication Date Title
Ahila Priyadharshini et al. A deep learning approach for person identification using ear biometrics
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN113743544A (en) Cross-modal neural network construction method, pedestrian retrieval method and system
CN106919920B (en) Scene recognition method based on convolution characteristics and space vision bag-of-words model
CN110135249B (en) Human behavior identification method based on time attention mechanism and LSTM (least Square TM)
Ferreira et al. Physiological inspired deep neural networks for emotion recognition
WO2021155792A1 (en) Processing apparatus, method and storage medium
Zhang et al. Image-to-video person re-identification with temporally memorized similarity learning
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
Komorowski et al. Minkloc++: lidar and monocular image fusion for place recognition
CN112651262B (en) Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
Gou et al. Cascade learning from adversarial synthetic images for accurate pupil detection
CN110222718B (en) Image processing method and device
CN110135277B (en) Human behavior recognition method based on convolutional neural network
CN113870160A (en) Point cloud data processing method based on converter neural network
CN112906520A (en) Gesture coding-based action recognition method and device
Hu et al. A spatio-temporal integrated model based on local and global features for video expression recognition
CN113343966B (en) Infrared and visible light image text description generation method
Sajid et al. Facial asymmetry-based feature extraction for different applications: a review complemented by new advances
Guo et al. Facial expression recognition: a review
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
Cui et al. Multisource learning for skeleton-based action recognition using deep LSTM and CNN
Afrasiabi et al. Spatial-temporal dual-actor CNN for human interaction prediction in video
Hashim et al. An Optimized Image Annotation Method Utilizing Integrating Neural Networks Model and Slantlet Transformation
Zhao et al. Research on human behavior recognition in video based on 3DCCA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211203

RJ01 Rejection of invention patent application after publication