CN113743544A

CN113743544A - Cross-modal neural network construction method, pedestrian retrieval method and system

Info

Publication number: CN113743544A
Application number: CN202111302766.1A
Authority: CN
Inventors: 张德馨
Original assignee: Zhongkezhiwei Technology Tianjin Co ltd
Current assignee: Zhongkezhiwei Technology Tianjin Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2021-12-03

Abstract

The invention discloses a cross-modal neural network construction method, a pedestrian retrieval method and a cross-modal neural network system, belongs to the technical field of data analysis and retrieval, and can solve the problems of low cross-modal identification precision and poor pedestrian identification effect. The method comprises the following steps: acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data; inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global characteristic and a visible light local characteristic; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature; and training the first neural network by utilizing the visible light global characteristic, the visible light local characteristic, the near-infrared global characteristic and the near-infrared local characteristic to obtain the cross-modal neural network. The method is used for cross-modal image recognition.

Description

Cross-modal neural network construction method, pedestrian retrieval method and system

Technical Field

The invention relates to a cross-modal neural network construction method, a pedestrian retrieval method and a system, and belongs to the technical field of data analysis and retrieval.

Background

In recent years, the development of the academic and industrial fields is continuously promoted by the continuous progress of the artificial intelligence technology, especially in the field of computer vision, from the traditional feature extraction to the current deep learning technology. The pedestrian re-recognition technology is another important human-centered research field which follows the face recognition technology, and the field has very important practical significance and business transformation prospect in the real society. The pedestrian re-identification (Person re-identification) aims at realizing cross-camera pedestrian feature extraction and retrieval by relying on monitoring equipment which is distributed all over the region and all scenes.

Most of the traditional pedestrian re-identification research methods focus on the problems of human body posture, background, illumination and the like under the condition of visible light. The method mainly adopts a pedestrian feature extraction or generation-based mode to realize pedestrian re-identification. In practical monitoring systems, especially under the condition of insufficient light or darkness, the camera usually needs to be switched to an infrared mode to acquire a pedestrian or target image, so that the problem of re-identification of the pedestrian image under daily visible light and the pedestrian under the near-infrared cross-modal mode has to be faced. The cross-modal pedestrian retrieval is to identify and compare pedestrians in a visible light state (natural state) and a near infrared state (state of different spectrums of the pedestrians captured by the camera). At present, the method mainly has two ideas, namely a pedestrian feature extraction method based on near infrared and visible light modes, for example, a plurality of sub-networks are respectively responsible for image input of near infrared and visible light, and then are fused to a shared network to learn fusion features; another method is to convert pedestrian images in two different modalities into the same modality based on a Generative Adaptive Networks (GANs) to convert into the same modality pedestrian re-identification process. However, in practical applications, the recognition accuracy of the two methods is not high, so that the pedestrian recognition effect is poor.

Disclosure of Invention

The invention provides a cross-modal neural network construction method, a pedestrian retrieval method and a system, which can solve the problems of low cross-modal identification precision and poor pedestrian identification effect in the prior art.

In one aspect, the present invention provides a method for constructing a cross-modal neural network, the method comprising:

step 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data;

step 12, inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global feature and a visible light local feature; inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature;

and step 13, training a first neural network by using the visible light global feature, the visible light local feature, the near-infrared global feature and the near-infrared local feature to obtain a cross-modal neural network.

Optionally, the preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data specifically includes:

splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks respectively to form a visible light block sequence set and a near-infrared block sequence set; the image block splitting rules of the visible light sample image and the near infrared sample image are the same;

inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing position information of each near-infrared image block.

Optionally, step 13 specifically includes:

inputting the visible light global features into a first neural network, training a first preset mapping matrix, and obtaining a visible light global mapping matrix;

inputting the visible light local features into a first neural network, training a second preset mapping matrix, and obtaining a visible light local mapping matrix;

inputting the near-infrared global features into a first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;

inputting the near-infrared local features into a first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;

and constructing a cross-modal neural network according to the visible light global mapping matrix, the visible light local mapping matrix, the near-infrared global mapping matrix and the near-infrared local mapping matrix.

Optionally, the loss function across the modal neural network is:

；

wherein, L2 is to find L2 loss for two input vectors;

globally mapping a matrix for visible light;

is a visible light global feature;

a near-infrared global mapping matrix is obtained;

in order to be a near-infrared global feature,

locally mapping a matrix for visible light;

is a visible light local feature;

a near infrared local mapping matrix is obtained;

is a near infrared local feature;j=1,…,k ；kis the number of split image blocks.

Optionally, the loss function of the first self-attention mechanism module is:

；

wherein the content of the first and second substances,

is the global loss of visible light;

is a local loss of visible light;

the loss function of the second self-attention mechanism module is:

；

wherein the content of the first and second substances,

is near infrared global loss;

is a local loss of near infrared.

Optionally, the near-infrared global loss, the near-infrared local loss, the visible light global loss, and the visible light local loss are calculated in a cross entropy loss or local triple loss calculation manner.

In another aspect, the present invention provides a cross-modal pedestrian retrieval method, including:

step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;

step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;

step 23, inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;

and 24, performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.

Optionally, the preprocessing the target pedestrian image to obtain target image block sequence data specifically includes:

splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;

and inputting the target image block sequence set into a linear projection module corresponding to the image type of the target pedestrian image to obtain target image block sequence data containing the position information of each target image block.

Optionally, if the target pedestrian image is a visible light image, the step 23 specifically includes:

inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;

inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;

if the target pedestrian image is a near-infrared image, the step 23 specifically includes:

inputting the global features of the target image into the cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;

inputting the local features of the target image into the cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.

In yet another aspect, the present invention provides a cross-modal pedestrian retrieval system, the system comprising:

the preprocessing module is used for acquiring a target pedestrian image and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image;

the first feature extraction module is used for inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain a target image global feature and a target image local feature;

the second feature extraction module is used for inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;

and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by using the cross-modal global feature and/or the cross-modal local feature to obtain a cross-modal retrieval result.

Optionally, the preprocessing module is specifically configured to:

The invention can produce the beneficial effects that:

according to the cross-modal pedestrian retrieval method provided by the invention, the serialized image blocks are sent to the transform module to obtain the global features and the local features of respective modes, and then the features are sent to the feature extraction network, so that the cross-modal global features and the cross-modal local features of cross-modal pedestrians can be simultaneously obtained, the cross-modal global features and the cross-modal local features are utilized to retrieve target pedestrians, and the performance of cross-modal retrieval can be improved. In addition, the learned local features provide convenience for pedestrian retrieval in specific scenes, for example, in the scenes with incomplete pedestrian images such as occlusion and blurring, the use of the local features is beneficial to further improvement of cross-modal retrieval accuracy.

Drawings

FIG. 1 is a flowchart of a cross-modal neural network construction method according to an embodiment of the present invention;

fig. 2 is a flowchart of a cross-modal pedestrian retrieval method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a principle of a cross-modal pedestrian retrieval method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a cross-modal neural network feature extraction principle provided in the embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to examples, but the present invention is not limited to these examples.

The embodiment of the invention provides a cross-modal neural network construction method, as shown in fig. 1, the method comprises the following steps:

and 11, acquiring a visible light sample image and a near-infrared sample image, and preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data.

The method specifically comprises the following steps: respectively splitting the visible light sample image and the near-infrared sample image into a plurality of image blocks to form a visible light block sequence set and a near-infrared block sequence set;

inputting the visible light block sequence set into a first linear projection module to obtain visible light block sequence data containing position information of each visible light image block; and inputting the near-infrared block sequence set into a second linear projection module to obtain near-infrared block sequence data containing the position information of each near-infrared image block.

And the splitting rule of the image blocks of the visible light sample image and the near infrared sample image is the same.

Step 12, inputting the visible light block sequence data into a first self-attention mechanism module to obtain a visible light global feature and a visible light local feature; and inputting the near-infrared block sequence data into a second self-attention mechanism module to obtain a near-infrared global feature and a near-infrared local feature.

The self-Attention mechanism module (i.e. transform structure) Is proposed by google in 17 years of Attention Is All You Need paper, and has a very good effect on multiple tasks of natural language processing, in which a word segmentation (token) in a sentence Is transformed to obtain embedded features, then information after an Attention structure Is added Is obtained by using the self-Attention mechanism module, and then an encoder structure and a decoder structure in natural language processing are formed by using a multi-layer stacked transform structure, so that the corresponding tasks are completed. In recent years, many people have been introduced into computer Vision, and particularly the Vision transform method is most effective, and a transform structure and a pre-training model on a large data set are introduced to fine-tune the image net, thereby achieving remarkable performance.

And step 13, training the first neural network by utilizing the visible light global characteristic, the visible light local characteristic, the near-infrared global characteristic and the near-infrared local characteristic to obtain the cross-modal neural network.

Specifically, the method comprises the following steps: inputting the global visible light features into a first neural network, training a first preset mapping matrix, and obtaining a global visible light mapping matrix;

inputting the local characteristics of the visible light into a first neural network, and training a second preset mapping matrix to obtain a local mapping matrix of the visible light;

inputting the near-infrared global features into the first neural network, and training a third preset mapping matrix to obtain a near-infrared global mapping matrix;

inputting the near-infrared local features into the first neural network, and training a fourth preset mapping matrix to obtain a near-infrared local mapping matrix;

Another embodiment of the present invention provides a cross-modal pedestrian retrieval method, as shown in fig. 2, the method includes:

step 21, acquiring a target pedestrian image, and preprocessing the target pedestrian image to obtain target image block sequence data; the target pedestrian image is a visible light image or a near infrared image.

The method specifically comprises the following steps: splitting the target pedestrian image into a plurality of image blocks to form a target image block sequence set;

For example, when the target pedestrian image is a visible light image, the target image block sequence set is input into the first linear projection module; and when the target pedestrian image is a near-infrared image, inputting the target image block sequence set into the second linear projection module.

And step 22, inputting the target image block sequence data into a self-attention mechanism module corresponding to the image type of the target pedestrian image to obtain the global feature and the local feature of the target image.

Illustratively, when the target pedestrian image is a visible light image, the target image block sequence data is input into the first self-attention mechanism module; and when the target pedestrian image is a near-infrared image, inputting the target image block sequence data into a second self-attention mechanism module.

Step 23, inputting the global features and the local features of the target image into a cross-modal neural network to obtain the cross-modal global features and the cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks.

For example, if the target pedestrian image is a visible light image, step 23 specifically includes:

inputting the global features of the target image into a cross-modal neural network, and mapping the global features of the target image by using a visible light global mapping matrix to obtain the cross-modal global features;

inputting the local features of the target image into a cross-modal neural network, and mapping the local features of the target image by using a visible light local mapping matrix to obtain the cross-modal local features;

if the target pedestrian image is a near-infrared image, step 23 specifically includes:

inputting the global features of the target image into a cross-modal neural network, and mapping the global features of the target image by using a near-infrared global mapping matrix to obtain the cross-modal global features;

inputting the local features of the target image into a cross-modal neural network, and mapping the local features of the target image by using a near-infrared local mapping matrix to obtain the cross-modal local features.

Referring to fig. 3, in the preprocessing stage of the image, the images of different modalities, i.e. the visible light image (shown in fig. 3 as a gray image, which is a color image in practice) and the near-infrared image, are split into a plurality of blocks, which may not be overlapped or overlapped, but ensure that the splitting rules of different modalities are consistent, so as to form a block sequence of the images of respective modalities. Based on a traditional transform method, sequence image sets of two modalities are sent to respective linear projection modules, then position information of each image block is added and sent to respective transform modules, and global features and local features of respective modalities are obtained. Here, the module embedded in the block information may be learned, and the corresponding output classification information encoding layer is the global feature of the full image, and each corresponding block is output as the local feature containing the position information of the full image after passing through the encoding layer. The global features and the local features of the visible light and the near infrared are sent to a cross-modal neural network for feature learning, so that a feature set suitable for cross-modal retrieval is further extracted, namely the cross-modal global features and the cross-modal local features.

In this process, first, in the training of the respective modalities, the global feature and the local feature of the respective modalities need to converge, that is:

the loss function of the first self-attention mechanism module is:

；

wherein the content of the first and second substances,

is the global loss of visible light;

is a local loss of visible light;

the loss function of the second self-attention mechanism module is:

；

wherein the content of the first and second substances,

is near infrared global loss;

is a local loss of near infrared.

Among them, near infrared modality uses "

"for visible light modality"

"means that the number of the terminal is,

for the number of image blocks, for loss calculation "

"means. And the loss calculation can adopt the combination of cross entropy loss, local triple loss and the like to improve the characterization capability of the features.

In the feature set across the modes, how to map the shared better global features and local features from the global features and local features of the two modes needs to be considered, namely, a global mapping matrix is needed to pass through each mode

，

To the same space, and the mapping characteristics of the local characteristics are similar

. As shown with reference to figure 4 of the drawings,

for global features learned across modes, which are adapted to cross-mode retrieval, d1 is the dimension of the global feature,

for local features learned across modalities that are adapted to cross-modality retrieval, d2 is the dimension of the local feature. Then can obtain

。

The loss function across the modal neural network can be defined as:

；

wherein, L2 is to find L2 loss for two input vectors;

to be visibleAn optical global mapping matrix;

is a visible light global feature;

a near-infrared global mapping matrix is obtained;

in order to be a near-infrared global feature,

locally mapping a matrix for visible light;

is a visible light local feature;

a near infrared local mapping matrix is obtained;

It is desirable to obtain a corresponding mapping matrix by optimizing the objective function to realize learning of a better feature subset, and it should be noted that in the cross-modal feature learning, global features and local features of visible light and near-infrared images are cooperatively learned, so that a better feature set adapted to cross-modal recognition can be obtained to improve the cross-modal retrieval performance.

In the overall cross-modal feature learning, the learning of a transform network inside each modality is considered, and the overall loss function

Then the definition is as follows:

；

wherein the content of the first and second substances,

is a loss weight used to balance and adjust the measure of loss from modality and loss across modalities.

The invention adopts the simultaneous retrieval of the global characteristic and the local characteristic, and can improve the accuracy of cross-modal retrieval. In addition, when the global features are invalid due to incomplete global information of pedestrians caused by possible situations of shielding, blurring and the like in some complex scenes, local features can be adopted for targeted retrieval. That is, when a search image is obtained, a certain local area of the search image may be specified, after the global and local features of the image are extracted, an image set similar to the local features may be queried in a cross-modal database by using the corresponding local features, and a query result is returned. The invention can flexibly apply local characteristics to carry out cross-modal retrieval, and improves the retrieval precision and the retrieval flexibility of the whole system.

Another embodiment of the present invention provides a cross-modal pedestrian retrieval system, including:

the second feature extraction module is used for inputting the global features and the local features of the target image into the cross-modal neural network to obtain the cross-modal global features and the cross-modal local features; wherein the cross-modal neural network is any one of the above cross-modal neural networks;

and the retrieval module is used for performing feature matching retrieval on the video set containing the target pedestrian by utilizing the cross-modal global features and/or the cross-modal local features to obtain a cross-modal retrieval result.

Further, the preprocessing module is specifically configured to:

The functional explanation of each module in the above retrieval system may refer to the explanation of each step in the retrieval method, and is not described herein again.

Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A method for constructing a cross-modal neural network, the method comprising:

2. The method according to claim 1, wherein the preprocessing the visible light sample image and the near-infrared sample image to obtain visible light block sequence data and near-infrared block sequence data comprises:

3. The method according to claim 1, characterized in that said step 13 comprises in particular:

4. The method of claim 1, wherein the loss function across the modal neural network is:

；

wherein, L2 is to find L2 loss for two input vectors;

globally mapping a matrix for visible light;

is a visible light global feature;

a near-infrared global mapping matrix is obtained;

in order to be a near-infrared global feature,

locally mapping a matrix for visible light;

is a visible light local feature;

a near infrared local mapping matrix is obtained;

is a near infrared local feature;j=1,…,k ；kis the number of split image blocks.

5. The method of claim 1, wherein the loss function of the first self-attention mechanism module is:

；

wherein the content of the first and second substances,

is the global loss of visible light;

is a local loss of visible light;

the loss function of the second self-attention mechanism module is:

；

wherein the content of the first and second substances,

is near infrared global loss;

is a local loss of near infrared.

6. A cross-modal pedestrian retrieval method, comprising:

step 23, inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is the cross-modal neural network of any one of claims 1 to 5;

7. The method according to claim 6, wherein the preprocessing the image of the target pedestrian to obtain the target image block sequence data specifically comprises:

8. The method according to claim 6, wherein if the target pedestrian image is a visible light image, the step 23 specifically includes:

9. A cross-modal pedestrian retrieval system, the system comprising:

the second feature extraction module is used for inputting the target image global features and the target image local features into a cross-modal neural network to obtain cross-modal global features and cross-modal local features; wherein the cross-modal neural network is the cross-modal neural network of any one of claims 1 to 5;

10. The system of claim 9, wherein the preprocessing module is specifically configured to: