WO2024060321A1 - 一种用于增强行人局部特征的联合建模方法和装置 - Google Patents

一种用于增强行人局部特征的联合建模方法和装置 Download PDF

Info

Publication number
WO2024060321A1
WO2024060321A1 PCT/CN2022/124009 CN2022124009W WO2024060321A1 WO 2024060321 A1 WO2024060321 A1 WO 2024060321A1 CN 2022124009 W CN2022124009 W CN 2022124009W WO 2024060321 A1 WO2024060321 A1 WO 2024060321A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
neural network
pedestrians
pedestrian
enhanced
Prior art date
Application number
PCT/CN2022/124009
Other languages
English (en)
French (fr)
Inventor
王宏升
陈�光
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to US18/072,002 priority Critical patent/US11810366B1/en
Publication of WO2024060321A1 publication Critical patent/WO2024060321A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Definitions

  • the invention relates to the field of computer vision, and in particular to a joint modeling method and device for enhancing local features of pedestrians.
  • Pedestrian re-identification is a technology that uses computer vision technology to detect and identify the presence of specific pedestrians in surveillance videos or images. There are differences in the posture, appearance, shooting distance, etc. of pedestrians captured by different cameras. Usually, there are differences in the posture, appearance, distance, and clarity of pedestrians captured by cameras. In most cases, usable face images cannot be obtained and pedestrians cannot be identified. In most cases, In this case, it is impossible to obtain usable face images. In this case, the joint modeling method that enhances the local characteristics of pedestrians and the device pedestrian re-identification technology are used to realize pedestrian recognition and tracking, which is widely used in the fields of video surveillance and security.
  • the purpose of the present invention is to provide a joint modeling method and device for enhancing local features of pedestrians to overcome the shortcomings of the existing technology.
  • the invention discloses a joint modeling method for enhancing local characteristics of pedestrians, which includes the following steps:
  • S1 Obtain the original surveillance video image data set, and divide the original surveillance video image data set into the training set and the test set in proportion;
  • S3 Construct a multi-head attention neural network, input the image block vector sequence into the multi-head attention neural network, and extract local features of pedestrians;
  • S4 Construct an enhanced channel feature neural network, input the image into the enhanced channel feature neural network, and use three-channel image convolution to capture the differential features between pedestrian image channels;
  • S5 Construct an enhanced spatial feature neural network, input the image into the enhanced spatial feature neural network, use spatial convolution, and scan to obtain the spatial difference features of the pedestrian image;
  • S6 Interactively splice the pedestrian local features of the multi-head attention neural network, the pedestrian image channel difference features of the enhanced channel feature neural network, and the pedestrian image spatial difference features of the enhanced spatial feature neural network, and perform joint modeling to enhance the pedestrian local features. feature;
  • S7 Input the enhanced local features of pedestrians into the feedforward neural network to identify pedestrians in the image;
  • S8 Iteratively train the neural network obtained by joint modeling to obtain a joint pedestrian re-identification model and identify pedestrians.
  • the original surveillance video image data set in step S1 includes image annotation samples, image annotation sample coordinate files, and unlabeled samples.
  • the step S2 includes the following sub-steps:
  • the length of the image block vector sequence is equal to the image size multiplied by the image height multiplied by the image width.
  • the image block vector sequence contains the image block position coordinates.
  • the sequence is converted into a matrix, and the matrix is used as a multi-head attention input to the force neural network.
  • S31 Calculate a single attention: For the query matrix, key matrix, and value matrix that exist in the image block vector sequence in step S3, perform matrix multiplication of the query matrix and the key matrix to obtain the attention score matrix. The score matrix acts on the value matrix, and the two matrices are multiplied together and then passed through the activation function to obtain a single attention;
  • S32 Construct multi-head attention: For the image block vector sequence, calculate a single attention for each image block vector sequence, and interactively calculate the single attention calculated for each image block vector sequence to obtain multi-head attention;
  • S33 Use multi-head attention to extract local features of pedestrians: input the image block vector sequence into the constructed multi-head attention neural network, and use the local multi-head self-attention mechanism to calculate the local self-attention of the pixels of each image and the pixels of adjacent images , extract local features of pedestrians through parallel matrix multiplication calculation.
  • the step S4 includes the following sub-steps:
  • S41 Construct a three-channel enhanced image convolutional neural network for the three channels of the input image.
  • the three-channel enhanced image convolutional neural network includes three convolution kernels, and the three convolution kernels correspond to the three channels of the image respectively;
  • S43 The three convolution kernels are calculated independently, and the difference parameter weights between the three channels are learned to obtain three channel feature space graphs.
  • the three channel feature space graphs are interactively calculated to obtain the pedestrian image channel features.
  • step S5 includes the following sub-steps:
  • S51 Define a two-dimensional convolution and divide the two-dimensional convolution into two sub-convolution kernels in space;
  • S52 Use two sub-convolution kernels to scan the image spatial features respectively to obtain two spatial features, and perform matrix multiplication of the two spatial features to obtain the spatial difference features of the pedestrian image.
  • step S6 includes the following sub-steps:
  • S61 Interactive splicing of enhanced channel feature neural network to multi-head attention neural network: First, the output of the convolutional network passes through the global average pooling layer, and the first layer of three-channel convolution is used to learn the weight parameters between image channels. After the layer activation function, the second layer three-channel convolution is used to transform the dimension. Finally, the feature value is converted into a probability distribution through the activation function, and the probability distribution is input into the multi-head self-attention branch for calculation;
  • S62 Interactive splicing of multi-head attention neural network to enhanced channel feature neural network: the output of multi-head attention calculation goes through the first layer of three-channel convolution, learns the different weight parameters between the three channels, and converts the image channel number into one , after passing through the first layer of activation function and then through the second layer of three-channel convolution, the learned weight parameters are reduced, and after passing through the second layer of activation function, it becomes a probability distribution in the spatial dimension, which is used as the enhanced channel feature convolution network branch.
  • Output the output of multi-head attention calculation goes through the first layer of three-channel convolution, learns the different weight parameters between the three channels, and converts the image channel number into one , after passing through the first layer of activation function and then through the second layer of three-channel convolution, the learned weight parameters are reduced, and after passing through the second layer of activation function, it becomes a probability distribution in the spatial dimension, which is used as the enhanced channel feature convolution network branch.
  • S63 Interactive splicing of enhanced spatial feature neural network to enhanced channel feature neural network: the two sub-convolutions of the two-dimensional convolution of the enhanced spatial feature neural network output the pedestrian multi-dimensional convolution spatial feature matrix, and convert the pedestrian multi-dimensional spatial feature matrix into a two-dimensional convolution
  • the dimensional space feature matrix is used as the output of the enhanced channel feature neural network through matrix multiplication and activation function;
  • step S7 includes the following sub-steps:
  • S72 Based on the coordinates of the identified pedestrians and the image annotation samples in the original surveillance video image data set, calculate the intersection ratio of the two coordinates, and calculate the precision and recall rates.
  • the precision rate is for the identified pedestrians, which means the prediction is The proportion of real pedestrians in the positive samples;
  • the recall rate is for the image annotation samples in the original surveillance video image data set, indicating the proportion of correctly identified pedestrians among the positive examples in the sample.
  • step S8 includes the following sub-steps:
  • step S82 According to the joint pedestrian re-identification model trained in step S81, input the original surveillance video image test set for prediction, and select pedestrians in the image to achieve pedestrian re-identification.
  • the invention discloses a joint modeling device for enhancing local characteristics of pedestrians, which includes the following modules:
  • the image segmentation module divides the image according to the pass to obtain image blocks
  • Pedestrian local feature module is used to build a multi-head attention neural network and extract pedestrian local features
  • the difference feature module between pedestrian image channels is used to build an enhanced channel feature neural network, using a convolutional neural network to capture the difference features between pedestrian image channels;
  • the pedestrian image spatial difference feature module is used to construct an enhanced spatial feature neural network and scan the pedestrian image spatial difference features;
  • the enhanced pedestrian local feature module is used to combine the pedestrian local features of the multi-head attention neural network and the pedestrian image of the enhanced channel feature neural network Difference features between channels, spatial difference features of pedestrian images through enhanced spatial feature neural network, interactive splicing, and joint modeling;
  • the pedestrian recognition module is used to build a feedforward neural network to enhance the local features of pedestrians and map them into pedestrian probability outputs through linear transformation;
  • Model training module used to iteratively train the neural network obtained by joint modeling, update the model parameters until the model training converges, and obtain a pedestrian recognition model;
  • Image pedestrian recognition module Identify pedestrians in the test set pedestrian recognition model.
  • the present invention discloses a joint modeling device for enhancing local features of pedestrians.
  • the device includes a memory and one or more processors.
  • the memory stores executable code.
  • the one or more processors execute The executable code is used to implement the above joint modeling method for enhancing local features of pedestrians.
  • the present invention uses a joint modeling method and device for enhancing the local characteristics of pedestrians to achieve pedestrian re-identification, using a multi-head attention neural network to extract local characteristics of pedestrians in video images, and using channel convolution kernels to learn image channel weights Parameters, use spatial convolution to scan spatial features on the image, enhance local features of pedestrians to improve pedestrian recognition rate, use feedforward neural network and activation function, input the feedforward neural network through linear layer transformation, and use activation function to map pedestrian probability distribution Component classification, identify pedestrians, output the position coordinates of the pedestrians in the image and select the pedestrians to achieve pedestrian re-identification, so that usable face images can be obtained.
  • Figure 1 is an overall flow chart of an embodiment of the present invention
  • Figure 2 is a schematic diagram of partial feature extraction of pedestrians from surveillance video images according to an embodiment of the present invention
  • Figure 3 is a schematic diagram of capturing human image channel features in an embodiment of the present invention.
  • Figure 4 is a schematic diagram of human image spatial feature scanning in an embodiment of the present invention.
  • Figure 5 is a schematic diagram of enhanced local features of pedestrians according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a device according to an embodiment of the present invention.
  • a pedestrian re-identification method of the present invention is used to enhance joint modeling of local characteristics of pedestrians.
  • Video images are segmented to obtain image blocks; the image block sequence is input into a multi-head attention neural network to extract local characteristics of pedestrians;
  • the image block is input into a three-channel convolutional neural network to capture pedestrian image channel characteristics; an enhanced channel feature neural network is constructed to capture the differential characteristics between pedestrian image channels; local features, image channel characteristics, and spatial features are interactively spliced and jointly modeled; the enhanced channel features are
  • the local features of pedestrians are input into the feedforward neural network to identify pedestrians in the image; the multi-head attention neural network and the convolutional neural network are iteratively trained to obtain a joint pedestrian re-identification model; the test set is input into the joint pedestrian re-identification model to output the pedestrian recognition results .
  • the method and device can be used to monitor videos and images across multiple cameras, and track and identify target pedestrians.
  • This invention is a joint modeling method for enhancing local characteristics of pedestrians. The whole process is divided into eight stages:
  • the original surveillance video image data set is obtained, and the original surveillance video image data set is divided into a training set and a test set in proportion;
  • the second stage, surveillance video image segmentation segment the original surveillance video image training set images according to image channels to obtain image blocks;
  • a multi-head attention neural network (Transformer) is used to extract features of image blocks;
  • the fourth stage, pedestrian image channel feature capture three-channel image convolution is used to capture image channel features
  • the fifth stage is pedestrian image spatial feature scanning: spatial convolution is used to scan the image spatial features;
  • the sixth stage is to enhance the local characteristics of pedestrians: local features, image channel features, and spatial features are interactively spliced to perform joint modeling to enhance the local characteristics of pedestrians;
  • the seventh stage is to identify pedestrians in the image: using a feedforward neural network and activation function, the obtained enhanced local features of pedestrians are input into the feedforward neural network, and after linear layer transformation, the activation function is used to map the pedestrian probability distribution into Classify and identify pedestrians;
  • the eighth stage, pedestrian re-identification joint model and pedestrian recognition iteratively train the pedestrian re-identification joint model to obtain the pedestrian re-identification joint model and identify pedestrians.
  • the original surveillance video image data set in the first stage includes image annotation samples, image annotation sample coordinate files, and unlabeled samples.
  • the second stage is specifically as follows: for each video surveillance image in the training set, the number is obtained by multiplying the image height by the width by the number of channels, and the image is segmented according to the obtained number, and each image block has a unique identifier, and a linear transformation is used to map image blocks of different sizes to the specified input size of the multi-head attention neural network, and each image block with a unique identifier is tiled to form a sequence to obtain an image block sequence, and the sequence length is equal to the number of image blocks multiplied by the image block height multiplied by the image block width, and the sequence contains the image block position coordinates, and then the sequence is converted into a matrix, and the matrix is used as the input of the multi-head attention neural network (Transformer).
  • Transformer the matrix is used as the input of the multi-head attention neural network
  • the third stage is specifically: inputting the matrix into the multi-head attention neural network (Transformer) to extract pedestrian local features.
  • Transformer multi-head attention neural network
  • Step 1 First calculate a single attention.
  • the image block vector sequence there is a query (Query) matrix, a key (Key) matrix, and a value (Value) matrix.
  • the matrix is obtained by multiplying the query (Query) matrix and the key (Key) matrix.
  • Get the attention score matrix apply the attention score matrix to the value (Value) matrix, multiply the two matrices and use the activation function to obtain a single attention; calculate the multi-head attention, and calculate each image separately for the image block vector sequence
  • Single attention of block vector sequence, multi-head attention is obtained by interactively calculating the single attention calculated for each image block vector sequence.
  • Step 2 Input the image block vector sequence into the multi-head attention neural network, calculate the local self-attention of the pixels of each image and the adjacent images, and capture the local characteristics of pedestrians through parallel matrix multiplication calculation.
  • the calculation method is as follows:
  • MultiHead(Query, Key, Value) Concat(Head 1 , Head 2 ...Head n );
  • Concat(Head 1 , Head 2 ...Head n ) represents multiple attention heads.
  • the fourth stage is specifically: input the image into a three-channel image convolutional neural network to capture the pedestrian image channel characteristics.
  • Figure 3 which is divided into the following sub-steps:
  • Step 1 Construct a three-channel image convolutional neural network for the three channels of the input image.
  • the three-channel image convolutional neural network includes three convolution kernels, and the three convolution kernels correspond to the three channels of the image; the three convolutional neural networks
  • the convolution kernel separately learns the weight parameters of the corresponding image channels and outputs three different sets of weight parameters.
  • the size of the convolution kernel is 1 ⁇ 1 ⁇ 3, where 3 is the number of channels of the input image; the image is input into the three-channel image convolution In the neural network, the input image is weighted and combined in the convolution depth direction.
  • 3 local features are output.
  • the local features include the weight parameters between the 3 channels. Calculate The formula is as follows:
  • O(i, j) is the output matrix
  • I is the input matrix
  • K is the convolution kernel matrix
  • the shape of the convolution kernel matrix K is m ⁇ n
  • I(i+m, j+n)K(m, n ) means that the element I(i+m,j+n) of the input matrix is multiplied by the element K(m,n) of the kernel matrix, Accumulate and sum in the horizontal and vertical directions of the matrix respectively.
  • Step 2 The three convolution kernels are calculated independently to learn the difference parameter weights between the three channels, and the three channel feature space graphs are obtained.
  • the three channel feature space graphs are interactively calculated to obtain the pedestrian image channel features.
  • the fifth stage is specifically: constructing an enhanced spatial feature neural network and scanning the spatial difference features of pedestrian images.
  • Step 1 Divide the two-dimensional convolution 3 ⁇ 3 into two sub-convolution kernels in space, the first convolution kernel size is 3 ⁇ 1, and the second convolution kernel size is 1 ⁇ 3;
  • Step 2 Use two sub-convolution kernels to scan the image spatial features respectively to obtain two spatial feature maps. Multiply the two sub-convolution matrices to obtain the image spatial features.
  • Step 1 Convolution to interactive splicing of multi-head attention.
  • the output of the convolutional network first passes through the global average pooling layer, first passes through the first layer of three-channel convolution and uses a 1 ⁇ 1 convolution kernel to perform weighting between channels.
  • the activation function GELU
  • Softmax the activation function
  • Step 2 Interactive splicing of multi-head attention to convolution branches.
  • the output of multi-head attention calculation is passed through the first layer of three-channel 1 ⁇ 1 convolution to capture local features, using the activation function (GELU), and then passed through the second layer of 1
  • the ⁇ 1 three-channel convolution transform reduces the dimension parameters and converts the image channel number to one, which is then transformed into a probability distribution in the spatial dimension through the activation function (Softmax) as the output in the convolution branch.
  • Step 3 Interactive splicing of the enhanced spatial feature neural network to the enhanced channel feature neural network.
  • the two sub-convolutions of the two-dimensional convolution of the enhanced spatial feature neural network output the pedestrian multi-dimensional convolution spatial feature matrix, and convert the pedestrian multi-dimensional spatial feature matrix into The two-dimensional spatial feature matrix is passed through matrix multiplication and activation function (Softmax) as the output of the enhanced channel feature neural network.
  • Softmax matrix multiplication and activation function
  • Step 4 Input the output of multi-head attention, channel convolution, and spatial convolution into the multi-layer perceptron.
  • the pedestrian's local features are mapped to the parallel branch through the linear layer for feature fusion calculation, and the enhanced pedestrian local features are obtained.
  • the calculation formula as follows:
  • X is the multi-head attention output, Perceptron.
  • the seventh stage is specifically: identifying pedestrians in the image, which is divided into the following sub-steps:
  • Step 1 Use the feedforward neural network and activation function (Softmax), input the obtained enhanced local features of pedestrians into the feedforward neural network, undergo linear layer transformation, and use the activation function (Softmax) to map the pedestrian probability distribution into categories , identify pedestrians;
  • Step 2 According to the identified pedestrians and the image annotation sample coordinates in the original surveillance video image data set, calculate the intersection ratio of the two coordinates; calculate the precision rate and recall rate.
  • the precision rate refers to the identified pedestrians, indicating that the prediction is How many of the positive samples are real pedestrians; the recall rate is for the image annotation samples in the original surveillance video image data set, indicating how many positive examples in the sample are correctly identified.
  • Step 1 To prevent the gradient explosion and gradient disappearance of the joint pedestrian re-identification model during the training process, use residual connection method , accelerate model convergence, iterate training, adjust training parameters, and obtain a joint pedestrian re-identification model;
  • Step 2 Based on the joint pedestrian re-identification model trained in step 1, input the original surveillance video image test set for prediction and select pedestrians in the image to achieve pedestrian re-identification.
  • Embodiments of the present invention also provide a joint modeling device for enhancing local characteristics of pedestrians, including the following modules: an original surveillance video image sample set module for acquiring the original data set; an image segmentation module for segmenting the image according to Obtain image blocks; the pedestrian local feature module builds a multi-head attention neural network to extract local features of pedestrians; the difference feature module between pedestrian image channels builds an enhanced channel feature neural network to capture the difference features between pedestrian image channels; the spatial difference of pedestrian images The feature module constructs an enhanced spatial feature neural network to scan the spatial difference features of pedestrian images; the enhanced pedestrian local feature module combines the pedestrian local features of the multi-head attention neural network, the difference features between pedestrian image channels of the enhanced channel feature neural network, and enhanced space The spatial difference characteristics of pedestrian images of the feature neural network are interactively spliced for joint modeling; the pedestrian recognition module constructs a feedforward neural network to enhance the pedestrian's local features and map them into pedestrian probability output through linear transformation; the model training module: converts the convolutional neural network Iteratively train with the multi-head attention
  • an embodiment of the present invention also provides a joint modeling device for enhancing local features of pedestrians, which also includes a memory and one or more processors.
  • the memory stores executable code.
  • the one or more When the processor executes the executable code, it is used to implement the joint modeling method for enhancing local features of pedestrians in the above embodiment.
  • the embodiment of the joint modeling device for enhancing the local characteristics of pedestrians of the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer.
  • the device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 6, it is a hardware structure diagram of any device with data processing capabilities where the joint modeling device for enhancing local characteristics of pedestrians is located.
  • any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. This will not be described again.
  • the implementation process of the functions and effects of each unit in the above device please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.
  • the device embodiment since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated.
  • the components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
  • Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the joint modeling method for enhancing local features of pedestrians in the above embodiments is implemented.
  • the computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory.
  • the computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, smart memory card (Smart Media Card, SMC), SD card, flash memory card equipped on the device (Flash Card) etc.
  • the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种用于增强行人局部特征的联合建模方法和装置,包括以下步骤:S1:获取原始监控视频图像数据集,将原始监控视频图像数据集按比例划分训练集与测试集;S2:将监控视频图像训练集进行切割得到图像块向量序列;本发明采用多头注意力神经网络提取视频图像行人局部特征,使用通道卷积核学习图像通道权重参数,利用空间卷积在图像上扫描空间特征,增强行人局部特征提高行人识别率,采用前馈神经网络与激活函数,输入前馈神经网络经过线性层变换,并采用激活函数将行人概率分布映射成分类,识别出行人,输出行人在图像中位置坐标并框选出行人,实现行人重识别,使得能够获取可用的人脸图像。

Description

一种用于增强行人局部特征的联合建模方法和装置
相关申请的交叉引用
本发明要求于2022年9月22日向中国国家知识产权局提交的申请号为CN202211155651.9、发明名称为“一种用于增强行人局部特征的联合建模方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机视觉领域,特别涉及一种用于增强行人局部特征的联合建模方法和装置。
背景技术
行人重识别,是一种利用计算机视觉技术检测识别监控视频或图像中是否存在特定行人的技术。不同摄像头下行人的姿态、外观、拍摄距离等存在差别,通常情况下摄像头拍摄行人的姿态、外观、距离、清晰度等存在差别,多数情况下无法获取可用的人脸图像,无法识别行人,多数情况下无法获取可用的人脸图像,此时使用增强行人局部特征的联合建模方法和装置行人重识别技术实现行人识别与跟踪,广泛用于视频监控与安防领域。
发明内容
本发明的目的在于提供一种用于增强行人局部特征的联合建模方法和装置,以克服现有技术中的不足。
为实现上述目的,本发明提供如下技术方案:
本发明公开了一种用于增强行人局部特征的联合建模方法,包括以下步骤:
S1:获取原始监控视频图像数据集,将原始监控视频图像数据集按比例划分训练集与测试集;
S2:将监控视频图像训练集进行切割得到图像块向量序列;
S3:构建多头注意力神经网络,将图像块向量序列输入多头注意力神经网络,提取行人局部特征;
S4:构建增强通道特征神经网络,将图像输入增强通道特征神经网络,采用三通道图像卷积,捕捉行人图像通道之间差异特征;
S5:构建增强空间特征神经网络,将图像输入增强空间特征神经网络,采用空间卷积,扫描得到行人图像空间差异特征;
S6:将多头注意力神经网络的行人局部特征、增强通道特征神经网络的行人图像通道之间差异特征、增强空间特征神经网络的行人图像空间差异特征,交互拼接,进行联合建模,增强 行人局部特征;
S7:将增强行人局部特征输入前馈神经网络,在图像中识别出行人;
S8:对联合建模得到的神经网络迭代训练,得到行人重识别联合模型并识别出行人。
作为优选的,所述步骤S1中原始监控视频图像数据集包括图像标注样本、图像标注样本坐标文件、未标注样本。
作为优选的,所述步骤S2包括如下子步骤:
S21:将监控视频图像按照图像的通道数进行分割,得到图像块;
S22:将图像块的高度与宽度转换成多头注意力神经网络固定的输入大小;
S23:将图像块平铺构成序列,得到图像块向量序列。
作为优选的,所述图像块向量序列的长度等于图像大小乘以图像高度乘以图像宽度,所述图像块向量序列包含图像块位置坐标,将序列转转换成矩阵,所述矩阵作为多头注意力神经网络的输入。
作为优选的,S31:计算单个注意力:对于步骤S3中图像块向量序列存在的查询矩阵、键矩阵、值矩阵,通过查询矩阵与键矩阵进行矩阵相乘得出注意力分数矩阵,将注意力分数矩阵作用于值矩阵,将两者矩阵相乘后经过激活函数得出单个注意力;
S32:构建多头注意力:针对图像块向量序列,分别计算每一个图像块向量序列的单个注意力,将每个图像块向量序列计算得出的单个注意力进行交互计算得到多头注意力;
S33:利用多头注意力提取行人局部特征:将图像块向量序列输入已构建的多头注意力神经网络,采用局部多头自注意力机制,计算每个图像的像素与相邻图像的像素局部自注意力,通过并行矩阵乘法计算,提取行人局部特征。
作为优选的,所述步骤S4包括如下子步骤:
S41:针对输入图像的三个通道,构建三通道增强图像卷积神经网络,三通道增强图像卷积神经网络包括三个卷积核,三个卷积核分别对应图像的三个通道;
S42:三个卷积核分别学习相对应图像通道的权重参数,输出三组不同的权重参数;
S43:三个卷积核之间独立计算,学习三个同通道之间的差异参数权重,得到三个通道特征空间图,将三个通道特征空间图交互计算得出行人图像通道特征。
作为优选的,所述步骤S5包括如下子步骤:
S51:定义一个二维卷积,将二维卷积在空间上分为两个子卷积核;
S52:使用两个子卷积核分别扫描图像空间特征,得到两个空间特征,将两个空间特征进行矩阵相乘得到行人图像空间差异特征。
作为优选的,所述步骤S6包括如下子步骤:
S61:增强通道特征神经网络至多头注意力神经网络的交互拼接:首先卷积网络的输出经过全局平均池化层,使用第一层三通道卷积进行图像通道间的权重参数学习,经过第一层激活函数之后再使用第二层三通道卷积变换维度,最后经过激活函数将特征值转化为概率分布,并将概率分布输入多头自注意力分支中计算;
S62:多头注意力神经网络至增强通道特征神经网络的交互拼接:多头注意力计算的输出经过第一层三通道卷积,学习三通道之间不同的权重参数,并将图像道数转换为一,经过第一层激活函数之后再经过第二层三通道卷积,减少学习到的权重参数,经过第二层激活函数,变为空间维度上的概率分布,作为增强通道特征卷积网络分支中的输出;
S63:增强空间特征神经网络至增强通道特征神经网络的交互拼接:增强空间特征神经网络的二维卷积的两个子卷积输出行人多维卷积空间特征矩阵,将行人多维空间特征矩阵转换成二维空间特征矩阵,通过矩阵乘法再经过激活函数,作为增强通道特征神经网络的输出;
S64:将多头注意力的输出、增强通道特征卷积的输出、空间卷积的输出输入多层感知机,行人局部特征通过线性层映射到并行分支进行特征融合计算,得到增强行人局部特征。
作为优选的,所述步骤S7包括如下子步骤:
S71:采用前馈神经网络与激活函数方式,针对已得出的增强行人局部特征输入前馈神经网络中,经过线性层变换,并采用激活函数将行人概率分布映射成分类,识别出行人;
S72:根据已识别的行人和原始监控视频图像数据集中的图像标注样本坐标,计算两者坐标的交并比、计算精确率与召回率,其中精确率是针对已识别出的行人,表示预测为正的样本中真正的行人的比例;召回率是针对原始监控视频图像数据集中的图像标注样本,表示样本中的正例中被正确识别行人比例。
作为优选的,所述步骤S8包括如下子步骤:
S81:对联合建模得到的神经网络采用残差连接方式,加速模型收敛,迭代训练,调整训练参数,得到行人重识别联合模型;
S82:根据步骤S81训练得到的行人重识别联合模型,输入原始监控视频图像测试集进行预测,并在图像中框选出行人,实现行人重识别。
本发明公开了一种用于增强行人局部特征的联合建模装置,包括以下模块:
获取原始监控视频图像样本集模块,用于获取原始数据集;
图像分割模块,将图像按照通分割得到图像块;
行人局部特征模块,用于构建多头注意力神经网络,提取行人局部特征;
行人图像通道之间差异特征模块,用于构建增强通道特征神经网络,采用卷积神经网络,捕捉行人图像通道之间差异特征;
行人图像空间差异特征模块,用于构建增强空间特征神经网络,扫描行人图像空间差异特征;增强行人局部特征模块,用于将多头注意力神经网络的行人局部特征、增强通道特征神经网络的行人图像通道之间差异特征、增强空间特征神经网络的行人图像空间差异特征,交互拼接,进行联合建模;
行人识别模块,用于构建前馈神经网络,增强行人局部特征经过线性变换映射成行人概率输出;
模型训练模块:用于对联合建模得到的神经网络迭代训练,更新模型参数,直到模型训练收敛,得到行人识别职合模型;
图像行人识别模块:将测试集行人识别职合模型中识别出行人。
本发明公开了一种用于增强行人局部特征的联合建模装置,所述装置包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述用于增强行人局部特征的联合建模方法。
本发明的有益效果:本发明一种用于增强行人局部特征的联合建模方法和装置实现行人重识别,采用多头注意力神经网络提取视频图像行人局部特征,使用通道卷积核学习图像通道权重参数,利用空间卷积在图像上扫描空间特征,增强行人局部特征提高行人识别率,采用前馈神经网络与激活函数,输入前馈神经网络经过线性层变换,并采用激活函数将行人概率分布映射成分类,识别出行人,输出行人在图像中位置坐标并框选出行人,实现行人重识别,使得能够获取可用的人脸图像。
附图说明
图1是本发明实施例的整体流程图;
图2是本发明实施例监控视频图像行人局部特提取示意图;
图3是本发明实施例行人图像通道特征捕捉示意图;
图4是本发明实施例行人图像空间特征扫描示意图;
图5是本发明实施例增强行人局部特征示意图;
图6是本发明实施例装置示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明了,下面通过附图及实施例,对本发明进行进一步详细说明。但是应该理解,此处所描述的具体实施例仅仅用以解释本发明,并 不用于限制本发明的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要地混淆本发明的概念。
参阅图1,本发明一种用于增强行人局部特征的联合建模的行人重识别方法,针对视频图像进进行分割得到图像块;将图像块序列输入多头注意力神经网络,提取行人局部特征;图像块输入三通道卷积神经网络,捕捉行人图像通道特征;构建增强通道特征神经网络,捕捉行人图像通道之间差异特征;局部特征、图通道特、空间特征交互拼接,联合建模;将增强行人局部特征输入前馈神经网络,在图像中识别出行人;对多头注意力神经网络与卷积神经网络迭代训练,得到行人重识别联合模型;将测试集输入行人重识别联合模型输出行人识别结果。使用该方法和装置能够跨多个摄像头监控视频与图像,对目标行人进行跟踪并识别。
通过以下步骤对本发明进行详细说明。
本发明是一种用于增强行人局部特征的联合建模方法,整个过程分为八个阶段:
第一阶段,获取原始监控视频图像数据集,将原始监控视频图像数据集按比例划分训练集与测试集;
第二阶段,监控视频图像分割:将原始监控视频图像训练集图像按照图像通道进行分割得到图像块;
第三阶段,监控视频图像行人局部特提取:采用多头注意力神经网络(Transformer)对图像块进行特征提取;
第四阶段,行人图像通道特征捕捉:采用三通道图像卷积,捕捉图像通道特征;
第五阶段,行人图像空间特征扫描:采用空间卷积,扫描图像空间特征;
第六阶段,增强行人局部特征:局部特征、图通道特、空间特征交互拼接,进行联合建模,增强行人局部特征;
第七阶段,图像中识别出行人:采用前馈神经网络与激活函数,针对已得出的增强行人局部特征输入前馈神经网络中,经过线性层变换,并采用激活函数将行人概率分布映射成分类,识别出行人;
第八阶段,行人重识别联合模型与行人识别:对行人重识别联合模型迭代训练,得到行人重识别联合模型并识别出行人。
进一步地,所述第一阶段中原始监控视频图像数据集包括图像标注样本、图像标注样本坐标文件、未标注样本。
进一步地,所述第二阶段具体为:将训练集中每张视频监控图像按照图像高度乘以宽度乘以通道数得出数量,并按照得出的数量进行图像分割,并且每个图像块拥有唯一的标识, 采用线性变换将大小不相同的图像块,映射成多头注意力神经网络的指定的输入大小,把每个拥有唯一的标识的图像块平铺构成序列,得到图像块序列,序列长度等于图像块数量乘以图像块高度乘以图像块宽度,序列包含图像块位置坐标,再将序列转转换成矩阵,矩阵作为多头注意力神经网络(Transformer)的输入。
进一步地,所述第三阶段具体为:将矩阵输入多头注意力神经网络(Transformer)行人局部特征提取,参阅图2,包括以下子步骤:
步骤一:首先计算单个注意力,对于图像块向量序列存在查询(Query)矩阵、键(Key)矩阵、值(Value)矩阵,通过查询(Query)矩阵与键(Key)矩阵进行矩阵相乘得出注意力分数矩阵,将注意力分数矩阵作用于值(Value)矩阵,两者矩阵相乘之后经过激活函数得出单个注意力;计算多头注意力,针对图像块向量序列,分别计算每一个图像块向量序列的单个注意力,将每个图像块向量序列计算得出的单个注意力进行交互计算得到多头注意力。
步骤二:将图像块向量序列输入多头注意力神经网络,计算每个图像的像素与相邻图像的像素局部自注意力,通过并行矩阵乘法计算,捕捉行人局部特征,计算方式如下:
1、将向量特征Query,Key,Value输入多头层,采用X=[x 1,x 2,x 3...x n]表示输入权重向量,通过Query和Key进行矩阵相乘计算,并通过激活函数(Softmax)计算向量注意力分布;
2、Quey=Key=Value=X通过激活函数(Softmax)计算多头注意力权重;
3、α i=Softmax(s(k i,q))=Softmax(s(x i,q)),其中α i为注意力概率分布,(s(x i,q))为注意力评分;
4、计算单个注意力:Head=Attention(Query,Key,Value);
5、多头注意力:
MultiHead(Query,Key,Value)=Concat(Head 1,Head 2...Head n);
其中Concat(Head 1,Head 2...Head n)表示多个注意力头。
进一步地,所述第四阶段具体为:将图像输入三通道图像卷积神经网络,捕捉行人图像通道特征,参阅图3,分为以下子步骤:
步骤一:针对输入图像的三个通道,构建三通道图像卷积神经网络,三通道图像卷积神经网络包括三个卷积核,三个卷积核分别对应图像的三个通道;三个卷积核分别学习相对应图像通道的权重参数,输出三组不同的权重参数,卷积核的尺寸为1×1×3,其中3为输入的图像的通道数;将图像输入三通道图像卷积神经网络中,输入的图像在卷积深度方向上进行加权组合,经过3个1×1×3的卷积核后,输出3个局部特征,局部特征包含3个通道之间的 权重参数,计算公式如下:
Figure PCTCN2022124009-appb-000001
其中:O(i,j)为输出矩阵,I为输入矩阵,K为卷积核矩阵,卷积核矩阵K形状为m×n;I(i+m,j+n)K(m,n)表示输入矩阵的元素I(i+m,j+n)与核矩阵的元素K(m,n)相乘,
Figure PCTCN2022124009-appb-000002
分别在矩阵的横向和纵向累加求和。
步骤二:三个卷积核之间独立计算,学习三个同通道之间的差异参数权重,得到三个通道特征空间图,将三个通道特征空间图交互计算得出行人图像通道特征。
进一步地,所述第五阶段具体为:构建增强空间特征神经网络,扫描行人图像空间差异特征,参阅图4,分为以下子步骤:
步骤一:将二维卷积3×3在空间上分为两个子卷积核,第一个卷积核大小为3×1,第二个卷积核大小为1×3;
步骤二:使用两个子卷积核分别扫描图像空间特征,得到两个空间特征图,将两个子卷积矩阵相乘得到图像空间特征。
进一步地,所述第六阶段具体为:多头注意力神经网络的输出、通道卷积神经网络的输出、增强空间特征神经网络的输出,交互拼接,联合建模,参阅图5,分为以下子步骤:步骤一:卷积至多头注意力的交互拼接,卷积网络的输出先经过全局平均池化层,首先经过第一层三通道卷积并使用1×1卷积核进行通道间的权重特征提取,采用激活函数(GELU),其次经过第二层1×1三通道卷积变换维度减少参数,最后经过激活函数(Softmax)将特征值转化为概率分布,将概率分布作为多头自注意力Value输入并计算。
步骤二:多头注意力至卷积分支的交互拼接,多头注意力计算的输出,经过第一层三通道1×1卷积,捕捉局部特征,采用激活函数(GELU),接着经过第二层1×1三通道卷积变换维度减少参数,并将图像道数转换为一,通过激活函数(Softmax)之后变为空间维度上的概率分布,作为卷积分支中的输出。
步骤三:增强空间特征神经网络至增强通道特征神经网络的交互拼接,增强空间特征神经网络的二维卷积的两个子卷积输出行人多维卷积空间特征矩阵,将行人多维空间特征矩阵转换成二维空间特征矩阵,通过矩阵乘法再经过激活函数(Softmax),作为增强通道特征神经网络的输出。
步骤四:将多头注意力的输出、通道卷积的输出、空间卷积的输出输入多层感知机,行人局部特征通过线性层映射到并行分支进行特征融合计算,得到增强行人局部特征,计算 公式如下:
X=Concat(LN(x),W-Loss,ConV)+x;
X′=MLP(LN(x′))+x′
其中:X为多头注意力输出,X′为卷积输出,Concat为拼接,W为权重,Loss为损失,ConV为卷积,x与x′为特征向量,LN为线性层,MLP为多层感知机。
进一步地,所述第七阶段具体为:图像中识别出行人,分为以下子步骤:
步骤一:采用前馈神经网络与激活函数(Softmax),针对已得出的增强行人局部特征输入前馈神经网络中,经过线性层变换,并采用激活函数(Softmax)将行人概率分布映射成分类,识别出行人;
步骤二:根据已识别的行人,原始监控视频图像数据集中的图像标注样本坐标,计算两者坐标的交并比;计算精确率与召回率,精确率是针已识别出的行人,表示预测为正的样本中有多少是真正的行人;召回率是针对原始监控视频图像数据集中的图像标注样本,表示样本中的正例有多少行人被正确识别。
进一步地,所述第八阶段具体为:行人重识别联合模型与行人识别,分为以下子步骤:步骤一:防止行人重识别联合模型在训练过程中梯度爆炸,梯度消失,采用残差连接方式,加速模型收敛,迭代训练,调整训练参数,得到行人重识别联合模型;
步骤二:根据步骤一训练得到的行人重识别联合模型,输入原始监控视频图像测试集进行预测并在图像中框选出行人,实现行人重识别。
本发明实施例还提供了一种用于增强行人局部特征的联合建模装置,包括以下模块:获取原始监控视频图像样本集模块,用于获取原始数据集;图像分割模块,将图像按照通分割得到图像块;行人局部特征模块,构建多头注意力神经网络,提取行人局部特征;行人图像通道之间差异特征模块,构建增强通道特征神经网络,捕捉行人图像通道之间差异特征;行人图像空间差异特征模块,构建增强空间特征神经网络,扫描行人图像空间差异特征;增强行人局部特征模块,将多头注意力神经网络的行人局部特征、增强通道特征神经网络的行人图像通道之间差异特征、增强空间特征神经网络的行人图像空间差异特征,交互拼接,进行联合建模;行人识别模块,构建前馈神经网络,增强行人局部特征经过线性变换映射成行人概率输出;模型训练模块:将卷积神经网络与多头注意力神经网络迭代训练,更新模型参数,直到模型训练收敛,得到行人识别职合模型;图像行人识别模块:将测试集行人识别职合模型中识别出行人。
参见图6,本发明实施例还提供了一种用于增强行人局部特征的联合建模装置,还包 括存储器和一个或多个处理器,存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现上述实施例中的用于增强行人局部特征的联合建模方法。
本发明一种用于增强行人局部特征的联合建模装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图6所示,为本发明一种用于增强行人局部特征的联合建模装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图6所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的用于增强行人局部特征的联合建模方法。
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换或改进等,均应包含在本发明的保护范围之内。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和 原则之内所作的任何修改、等同替换或改进等,均应包含在本发明的保护范围之内。

Claims (12)

  1. 一种用于增强行人局部特征的联合建模方法,其特征在于:包括以下步骤:
    S1:获取原始监控视频图像数据集,将原始监控视频图像数据集按比例划分训练集与测试集;
    S2:将监控视频图像训练集进行切割得到图像块向量序列;
    S3:构建多头注意力神经网络,将图像块向量序列输入多头注意力神经网络,提取行人局部特征;
    S4:构建增强通道特征神经网络,将图像输入增强通道特征神经网络,采用三通道图像卷积,捕捉行人图像通道之间差异特征;
    S5:构建增强空间特征神经网络,将图像输入增强空间特征神经网络,采用空间卷积,扫描得到行人图像空间差异特征;
    S6:将多头注意力神经网络的行人局部特征、增强通道特征神经网络的行人图像通道之间差异特征、增强空间特征神经网络的行人图像空间差异特征,交互拼接,进行联合建模,增强行人局部特征;
    S7:将增强行人局部特征输入前馈神经网络,在图像中识别出行人;
    S8:对联合建模得到的神经网络迭代训练,得到行人重识别联合模型并识别出行人。
  2. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S1中原始监控视频图像数据集包括图像标注样本、图像标注样本坐标文件、未标注样本。
  3. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S2包括如下子步骤:
    S21:将监控视频图像按照图像的通道数进行分割,得到图像块;
    S22:将图像块的高度与宽度转换成多头注意力神经网络固定的输入大小;
    S23:将图像块平铺构成序列,得到图像块向量序列。
  4. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述图像块向量序列的长度等于图像大小乘以图像高度乘以图像宽度,所述图像块向量序列包含图像块位置坐标,将序列转转换成矩阵,所述矩阵作为多头注意力神经网络的输入。
  5. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S3包括如下子步骤:
    S31:计算单个注意力:对于步骤S3中图像块向量序列存在的查询矩阵、键矩阵、值矩阵,通过查询矩阵与键矩阵进行矩阵相乘得出注意力分数矩阵,将注意力分数矩阵作用于值矩阵,将两者矩阵相乘后经过激活函数得出单个注意力;
    S32:构建多头注意力:针对图像块向量序列,分别计算每一个图像块向量序列的单个注意力, 将每个图像块向量序列计算得出的单个注意力进行交互计算得到多头注意力;
    S33:利用多头注意力提取行人局部特征:将图像块向量序列输入已构建的多头注意力神经网络,采用局部多头自注意力机制,计算每个图像的像素与相邻图像的像素局部自注意力,通过并行矩阵乘法计算,提取行人局部特征。
  6. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S4包括如下子步骤:
    S41:针对输入图像的三个通道,构建三通道增强图像卷积神经网络,三通道增强图像卷积神经网络包括三个卷积核,三个卷积核分别对应图像的三个通道;
    S42:三个卷积核分别学习相对应图像通道的权重参数,输出三组不同的权重参数;
    S43:三个卷积核之间独立计算,学习三个同通道之间的差异参数权重,得到三个通道特征空间图,将三个通道特征空间图交互计算得出行人图像通道特征。
  7. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S5包括如下子步骤:
    S51:定义一个二维卷积,将二维卷积在空间上分为两个子卷积核;
    S52:使用两个子卷积核分别扫描图像空间特征,得到两个空间特征,将两个空间特征进行矩阵相乘得到行人图像空间差异特征。
  8. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S6包括如下子步骤:
    S61:增强通道特征神经网络至多头注意力神经网络的交互拼接:首先卷积网络的输出经过全局平均池化层,使用第一层三通道卷积进行图像通道间的权重参数学习,经过第一层激活函数之后再使用第二层三通道卷积变换维度,最后经过激活函数将特征值转化为概率分布,并将概率分布输入多头自注意力分支中计算;
    S62:多头注意力神经网络至增强通道特征神经网络的交互拼接:多头注意力计算的输出经过第一层三通道卷积,学习三通道之间不同的权重参数,并将图像道数转换为一,经过第一层激活函数之后再经过第二层三通道卷积,减少学习到的权重参数,经过第二层激活函数,变为空间维度上的概率分布,作为增强通道特征卷积网络分支中的输出;
    S63:增强空间特征神经网络至增强通道特征神经网络的交互拼接:增强空间特征神经网络的二维卷积的两个子卷积输出行人多维卷积空间特征矩阵,将行人多维空间特征矩阵转换成二维空间特征矩阵,通过矩阵乘法再经过激活函数,作为增强通道特征神经网络的输出;
    S64:将多头注意力的输出、增强通道特征卷积的输出、空间卷积的输出输入多层感知机,行 人局部特征通过线性层映射到并行分支进行特征融合计算,得到增强行人局部特征。
  9. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S7包括如下子步骤:
    S71:采用前馈神经网络与激活函数方式,针对已得出的增强行人局部特征输入前馈神经网络中,经过线性层变换,并采用激活函数将行人概率分布映射成分类,识别出行人;
    S72:根据已识别的行人和原始监控视频图像数据集中的图像标注样本坐标,计算两者坐标的交并比、计算精确率与召回率,其中精确率是针对已识别出的行人,表示预测为正的样本中真正的行人的比例;召回率是针对原始监控视频图像数据集中的图像标注样本,表示样本中的正例中被正确识别行人比例。
  10. 如权利要求1所述的一种用于增强行人局部特征的联合建模方法,其特征在于:所述步骤S8包括如下子步骤:
    S81:对联合建模得到的神经网络采用残差连接方式,加速模型收敛,迭代训练,调整训练参数,得到行人重识别联合模型;
    S82:根据步骤S81训练得到的行人重识别联合模型,输入原始监控视频图像测试集进行预测,并在图像中框选出行人,实现行人重识别。
  11. 一种用于增强行人局部特征的联合建模装置,其特征在于,包括以下模块:
    获取原始监控视频图像样本集模块,用于获取原始数据集;
    图像分割模块,将图像按照通分割得到图像块;
    行人局部特征模块,用于构建多头注意力神经网络,提取行人局部特征;
    行人图像通道之间差异特征模块,用于构建增强通道特征神经网络,采用卷积神经网络,捕捉行人图像通道之间差异特征;
    行人图像空间差异特征模块,用于构建增强空间特征神经网络,扫描行人图像空间差异特征;
    增强行人局部特征模块,用于将多头注意力神经网络的行人局部特征、增强通道特征神经网络的行人图像通道之间差异特征、增强空间特征神经网络的行人图像空间差异特征,交互拼接,进行联合建模;
    行人识别模块,用于构建前馈神经网络,增强行人局部特征经过线性变换映射成行人概率输出;
    模型训练模块:用于对联合建模得到的神经网络迭代训练,更新模型参数,直到模型训练收敛,得到行人识别职合模型;
    图像行人识别模块:将测试集行人识别职合模型中识别出行人。
  12. 一种用于增强行人局部特征的联合建模装置,其特征在于:所述装置包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述一个或多个处理器执行所述可执行代码时,用于实现权利要求1-10任一项所述用于增强行人局部特征的联合建模方法。
PCT/CN2022/124009 2022-09-22 2022-10-09 一种用于增强行人局部特征的联合建模方法和装置 WO2024060321A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/072,002 US11810366B1 (en) 2022-09-22 2022-11-30 Joint modeling method and apparatus for enhancing local features of pedestrians

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211155651.9A CN115240121B (zh) 2022-09-22 2022-09-22 一种用于增强行人局部特征的联合建模方法和装置
CN202211155651.9 2022-09-22

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/072,002 Continuation US11810366B1 (en) 2022-09-22 2022-11-30 Joint modeling method and apparatus for enhancing local features of pedestrians

Publications (1)

Publication Number Publication Date
WO2024060321A1 true WO2024060321A1 (zh) 2024-03-28

Family

ID=83667112

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124009 WO2024060321A1 (zh) 2022-09-22 2022-10-09 一种用于增强行人局部特征的联合建模方法和装置

Country Status (2)

Country Link
CN (1) CN115240121B (zh)
WO (1) WO2024060321A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015662A (zh) * 2024-04-09 2024-05-10 沈阳二一三电子科技有限公司 基于Transformer多头自注意力机制的跨摄像头行人重识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150704A1 (en) * 2016-11-28 2018-05-31 Kwangwoon University Industry-Academic Collaboration Foundation Method of detecting pedestrian and vehicle based on convolutional neural network by using stereo camera
CN111368815A (zh) * 2020-05-28 2020-07-03 之江实验室 一种基于多部件自注意力机制的行人重识别方法
CN112836646A (zh) * 2021-02-05 2021-05-25 华南理工大学 一种基于通道注意力机制的视频行人重识别方法及应用
CN113221625A (zh) * 2021-03-02 2021-08-06 西安建筑科技大学 一种利用深度学习的局部特征对齐行人重识别方法
CN114783003A (zh) * 2022-06-23 2022-07-22 之江实验室 一种基于局部特征注意力的行人重识别方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160297B (zh) * 2019-12-31 2022-05-13 武汉大学 基于残差注意机制时空联合模型的行人重识别方法及装置
CN111507217A (zh) * 2020-04-08 2020-08-07 南京邮电大学 一种基于局部分辨特征融合的行人重识别方法
CN111539370B (zh) * 2020-04-30 2022-03-15 华中科技大学 一种基于多注意力联合学习的图像行人重识别方法和系统
CN112818931A (zh) * 2021-02-26 2021-05-18 中国矿业大学 基于多粒度深度特征融合的多尺度行人重识别方法
CN113516012B (zh) * 2021-04-09 2022-04-15 湖北工业大学 一种基于多层级特征融合的行人重识别方法及系统
CN113723366B (zh) * 2021-10-25 2022-03-25 山东力聚机器人科技股份有限公司 一种行人重识别方法、装置及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180150704A1 (en) * 2016-11-28 2018-05-31 Kwangwoon University Industry-Academic Collaboration Foundation Method of detecting pedestrian and vehicle based on convolutional neural network by using stereo camera
CN111368815A (zh) * 2020-05-28 2020-07-03 之江实验室 一种基于多部件自注意力机制的行人重识别方法
CN112836646A (zh) * 2021-02-05 2021-05-25 华南理工大学 一种基于通道注意力机制的视频行人重识别方法及应用
CN113221625A (zh) * 2021-03-02 2021-08-06 西安建筑科技大学 一种利用深度学习的局部特征对齐行人重识别方法
CN114783003A (zh) * 2022-06-23 2022-07-22 之江实验室 一种基于局部特征注意力的行人重识别方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015662A (zh) * 2024-04-09 2024-05-10 沈阳二一三电子科技有限公司 基于Transformer多头自注意力机制的跨摄像头行人重识别方法

Also Published As

Publication number Publication date
CN115240121A (zh) 2022-10-25
CN115240121B (zh) 2023-01-03

Similar Documents

Publication Publication Date Title
CN111709409B (zh) 人脸活体检测方法、装置、设备及介质
US11810366B1 (en) Joint modeling method and apparatus for enhancing local features of pedestrians
US10592780B2 (en) Neural network training system
Zeng et al. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions
US20200226781A1 (en) Image processing method and apparatus
WO2022000420A1 (zh) 人体动作识别方法、人体动作识别系统及设备
WO2020228525A1 (zh) 地点识别及其模型训练的方法和装置以及电子设备
CN114202672A (zh) 一种基于注意力机制的小目标检测方法
CN110717411A (zh) 一种基于深层特征融合的行人重识别方法
CN108648224B (zh) 一种基于人工神经网络的实时场景布局识别及重建的方法
CN110909651A (zh) 视频主体人物的识别方法、装置、设备及可读存储介质
WO2024021394A1 (zh) 全局特征与阶梯型局部特征融合的行人重识别方法及装置
WO2023082784A1 (zh) 一种基于局部特征注意力的行人重识别方法和装置
Zou et al. 3d manhattan room layout reconstruction from a single 360 image
CN111709313B (zh) 基于局部和通道组合特征的行人重识别方法
KR20100098641A (ko) 불변적인 시각적 장면 및 객체 인식
CN113408343B (zh) 基于双尺度时空分块互注意力的课堂动作识别方法
CN114283351A (zh) 视频场景分割方法、装置、设备及计算机可读存储介质
WO2024060321A1 (zh) 一种用于增强行人局部特征的联合建模方法和装置
CN117456136A (zh) 一种基于多模态视觉识别的数字孪生场景智能生成方法
Zhu et al. Simple, effective and general: A new backbone for cross-view image geo-localization
Wan et al. Drone image stitching using local mesh-based bundle adjustment and shape-preserving transform
CN114882537A (zh) 一种基于神经辐射场的手指新视角图像生成方法
CN111368733A (zh) 一种基于标签分布学习的三维手部姿态估计方法、存储介质及终端
CN114037046A (zh) 神经网络模型的蒸馏方法、装置及电子系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22959323

Country of ref document: EP

Kind code of ref document: A1