CN118155275A - Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device - Google Patents

Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device Download PDF

Info

Publication number
CN118155275A
CN118155275A CN202410155047.9A CN202410155047A CN118155275A CN 118155275 A CN118155275 A CN 118155275A CN 202410155047 A CN202410155047 A CN 202410155047A CN 118155275 A CN118155275 A CN 118155275A
Authority
CN
China
Prior art keywords
image
pedestrian
text
recognition
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410155047.9A
Other languages
Chinese (zh)
Inventor
林垠
陈叶瀚森
沙文
殷兵
吴子扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202410155047.9A priority Critical patent/CN118155275A/en
Publication of CN118155275A publication Critical patent/CN118155275A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a training method of a pedestrian re-recognition model, a pedestrian re-recognition method and a related device, wherein the training method of the pedestrian re-recognition model extracts image features of a sample human body image through an image encoder, and extracts text features of pedestrian re-recognition prompt words through a text encoder; the pedestrian re-recognition prompt word comprises a learnable feature, and the learnable feature is determined by aligning image features and text features; and optimizing parameters of the image encoder by taking the consistency of the pedestrian recognition result determined based on the image characteristics and the text characteristics with the pedestrian recognition label corresponding to the sample human body image as a target. The method and the device are used for training the model from two modes of the text and the image, improving the feature extraction capability of the model, introducing learning features obtained by aligning the image features and the text features into the pedestrian re-recognition prompt words, enabling the text features corresponding to the pedestrian re-recognition prompt words to be more suitable for pedestrian re-recognition tasks, and improving the recognition accuracy of pedestrian re-recognition technology.

Description

Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device
Technical Field
The present application relates to the field of pedestrian re-recognition technologies, and in particular, to a training method for a pedestrian re-recognition model, a pedestrian re-recognition method, and a related device.
Background
Pedestrian re-identification (ReID) refers to the task of retrieving a particular object from an image. At present, pedestrian re-identification technology has been widely applied to the fields of traffic management, public safety and the like. Therefore, how to improve the recognition accuracy of the pedestrian re-recognition technology, and further improve the application effect of the pedestrian re-recognition technology, is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the application provides a training method of a pedestrian re-recognition model, a pedestrian re-recognition method and a related device, and the method can improve the recognition accuracy of the pedestrian re-recognition technology, thereby improving the application effect of the pedestrian re-recognition technology.
The technical scheme provided by the application is as follows:
In a first aspect, an embodiment of the present application provides a training method for a pedestrian re-recognition model, including:
Extracting image features corresponding to the sample human body image through an image encoder, and extracting text features corresponding to the pedestrian re-recognition prompt words through a text encoder; the pedestrian re-recognition prompt comprises a learnable feature and a descriptive text of the sample human body image, wherein the learnable feature is determined by aligning an image feature output by the image encoder and a text feature output by the text encoder;
And optimizing parameters of the image encoder by taking the consistency of the pedestrian recognition result determined based on the image features and the text features and the pedestrian recognition label corresponding to the sample human body image as targets, wherein the image encoder is used for constructing and obtaining a pedestrian re-recognition model.
In a second aspect, an embodiment of the present application provides a pedestrian re-recognition method, including:
Inputting an image to be detected into a pre-trained pedestrian re-recognition model to obtain a detection result output by the pedestrian re-recognition model; the image encoder of the pedestrian re-recognition model is obtained by training the training method of the pedestrian re-recognition model.
In a third aspect, an embodiment of the present application provides a training apparatus for a pedestrian re-recognition model, including:
The extraction module is used for extracting image features corresponding to the sample human body images through the image encoder and extracting text features corresponding to the pedestrian re-recognition prompt words through the text encoder; the pedestrian re-recognition prompt comprises a learnable feature and a descriptive text of the sample human body image, wherein the learnable feature is determined by aligning an image feature output by the image encoder and a text feature output by the text encoder;
And the optimization module is used for optimizing parameters of the image encoder by taking the consistency of the pedestrian recognition result determined based on the image features and the text features and the pedestrian recognition label corresponding to the sample human body image as a target, and the image encoder is used for constructing and obtaining a pedestrian re-recognition model.
In a fourth aspect, an embodiment of the present application provides a pedestrian re-recognition apparatus including:
The pedestrian re-recognition module is used for inputting the image to be detected into a pre-trained pedestrian re-recognition model to obtain a detection result output by the pedestrian re-recognition model; the image encoder of the pedestrian re-recognition model is obtained by training the training method of the pedestrian re-recognition model.
In a fifth aspect, an embodiment of the present application provides an electronic device, including:
a memory and a processor; wherein the memory is used for storing programs; the processor is configured to implement the training method of the pedestrian re-recognition model according to any one of the above and/or implement the pedestrian re-recognition method according to any one of the above by running the program in the memory.
In a sixth aspect, an embodiment of the present application provides a storage medium having stored thereon a computer program that, when executed by a processor, implements the training method of the pedestrian re-recognition model described in any one of the above, and/or implements the pedestrian re-recognition method described in any one of the above.
In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program or computer instructions which, when executed by a processor, implement a training method of a pedestrian re-recognition model as described in any one of the above, and/or implement a pedestrian re-recognition method as described in any one of the above.
According to the training method of the pedestrian re-recognition model, image characteristics corresponding to the sample human body image are extracted through the image encoder, and text characteristics corresponding to the pedestrian re-recognition prompt word are extracted through the text encoder; the pedestrian re-recognition prompt word comprises a learner-driven feature and a descriptive text of the sample human body image, wherein the learner-driven feature is determined by aligning the image feature output by the image encoder and the text feature output by the text encoder; and optimizing parameters of an image encoder by taking the consistency of a pedestrian recognition result determined based on the image characteristics and the text characteristics with a pedestrian recognition label corresponding to the sample human body image as a target, wherein the image encoder is used for constructing and obtaining a pedestrian re-recognition model. By means of the arrangement, the pedestrian re-recognition model can be trained from two modes of the text and the image, the feature extraction capacity of the pedestrian re-recognition model is improved, learning features obtained by aligning the image features output by the image encoder and the text features output by the text encoder are introduced into the pedestrian re-recognition prompt words, the text features of the pedestrian re-recognition prompt words after being encoded by the text encoder can be more suitable for the pedestrian re-recognition task, and therefore the purposes of improving the recognition precision of the pedestrian re-recognition technology and the application effect of the pedestrian re-recognition technology are achieved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only embodiments of the present application, and other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a training method of a pedestrian re-recognition model according to an embodiment of the present application.
FIG. 2 is a schematic diagram of optimizing a learnable feature provided by an embodiment of the present application.
Fig. 3 is a schematic diagram of parameter adjustment of an image encoder according to an embodiment of the present application.
Fig. 4 is a flowchart of a pedestrian re-recognition method according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a training device for a pedestrian re-recognition model according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of a pedestrian re-recognition device according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The Person re-identification (ReID) task refers to a task of retrieving a specific object from an image. Pedestrian re-recognition has received increasing attention in both academia and industry in recent years due to its practical application importance in intelligent video surveillance.
At present, a classical pedestrian re-identification network scheme is mainly based on a deep learning method. The training paradigm is usually used for constructing a single-branch image appearance extraction model, and some structural designs and loss function designs are adopted to enhance the characterization capability and feature discriminant of the model. Therefore, the pedestrian appearance characterization with more generalization and robustness can be obtained and used for judging the feature similarity, so that the task of re-identifying the pedestrian is completed.
Through years of development, pedestrian re-recognition systems which rely only on single-mode information have approached the effect bottleneck. Therefore, how to improve the recognition accuracy of the pedestrian re-recognition technology, and further improve the application effect of the pedestrian re-recognition technology, is a technical problem to be solved by those skilled in the art.
Based on the above, the application provides a training method of a pedestrian re-recognition model, a pedestrian re-recognition method and a related device.
The embodiment of the application provides a training method of a pedestrian re-identification model, which can be executed by electronic equipment, wherein the electronic equipment can be any equipment with data and instruction processing functions, such as a computer, an intelligent terminal, a server and the like. Referring to fig. 1, the method includes:
s101, extracting image features corresponding to the sample human body images through an image encoder, and extracting text features corresponding to pedestrian re-recognition prompt words through a text encoder.
The image encoder and the text encoder are obtained after pre-training. In some embodiments, the image encoder and the text encoder may be trained by using large-scale image-text pair (image-text pair) data to perform language and image contrast learning training to obtain a pre-trained image encoder and text encoder. In other embodiments, the pre-trained image encoder and text encoder in the open source model CLIP (Constastive Language-IMAGE PRETRAINING) can be directly adopted, so that the cost of pre-training the text encoder and the image encoder can be reduced, and the effect gain of large-scale image-text on data can be better utilized.
The sample human body image refers to a training sample of the pedestrian re-recognition model. A large number of pedestrian images may be acquired as sample human body images, and the embodiment is not limited. The pedestrian image may be a picture of a pedestrian or a video frame extracted from a video including a pedestrian, and the embodiment is not limited.
The pedestrian re-recognition prompt word comprises a learner-based feature and descriptive text of the sample human body image. The description text of the sample human body image refers to text describing the entire content of the sample human body image. For example, the content of the sample human body image is a pedestrian, and the corresponding descriptive text may be "a photo of a person". The learnable features may be composed of sets of feature vectors and have learning capabilities that can be aligned with text features, image features in a high dimensional space. For example, in some particular embodiments, the learnable feature may be represented as: [ x1] [ x2] [ x3] [ x4], wherein [ x1] [ x2] [ x3] and [ x4] represent a set of vectors, respectively.
And fusing the learnable features and the description text of the sample human body image, so that the pedestrian re-recognition prompt word can be obtained. For example, in some specific embodiments, the descriptive text is "a photo of a person", the learnable feature is [ x1] [ x2] [ x3] [ x4], and the pedestrian re-recognition prompt may be "a photo of a [ x1] [ x2] [ x3] [ x4] person".
Further, the learnable features are determined by aligning image features output by the image encoder with text features output by the text encoder.
In some embodiments, the learnable feature may be obtained by:
(1) Extracting first image features corresponding to the sample human body image through an image encoder, and extracting first text features corresponding to the first pedestrian re-recognition prompt words through a text encoder;
(2) And aiming at aligning the first image feature with the first text feature, optimizing the first learner-based feature to obtain the learner-based feature.
The sample human body image refers to a training sample used in optimizing the training of the learnable features. When the learnable features are optimally trained, sample human body images are required to be used as training samples when the pedestrian re-recognition model is trained. Therefore, a large number of sample human body images can be acquired in advance, a part of which is used as a sample human body image when the learnable features are optimally trained, and a part of which is used as a sample human body image when the pedestrian re-recognition model is trained. The sample human body image during the optimization training of the learnable features and the sample human body image during the training of the pedestrian re-recognition model may include the same part or different parts in the sample human body image acquired in advance, which is not limited in this embodiment.
The first pedestrian re-recognition prompt includes a first learner-able feature and descriptive text of the sample human body image. The first learnable feature refers to an unoptimized finished learnable feature; the description text of the sample human body image has the same meaning as that of the description text of the above embodiment, and refers to the text describing the entire content of the sample human body image.
The sample human body image can be input into the image encoder, so that the image encoder encodes the sample human body image to obtain a first image characteristic extracted from the sample human body image by the image encoder; and inputting the first pedestrian re-recognition prompt word into a text encoder so that the text encoder encodes the first pedestrian re-recognition prompt word to obtain a first text feature extracted from the first pedestrian re-recognition prompt word by the text encoder.
After the first image feature and the first text feature are obtained, the first learner-based feature is optimized by taking the alignment of the first image feature and the first text feature as a target, and the learner-based feature is obtained. Specifically, after the first image feature and the first text feature are obtained, a difference between the first image feature and the first text feature may be determined. The difference between the first image feature and the first text feature may be characterized by a Loss function, such as using an L1 norm Loss (L1-Loss), an L2 norm Loss (L2-Loss), a contrast learning Loss (InfoNCE-Loss), and the like, which is not limited in this embodiment. And optimizing the first learnable feature with the aim of reducing the difference between the first image feature and the first text feature to obtain the learnable feature.
In some embodiments, a gradient update strategy is used to optimize the loss function to avoid possible problems of gradient explosion, gradient extinction, etc.
In a specific embodiment, the first image feature corresponding to the sample human body image may be extracted by the image encoder, and the first text feature corresponding to the first pedestrian re-recognition prompt may be extracted by the text encoder. The first learnable feature in the current first pedestrian re-recognition prompt corresponds to the first beginning learnable feature that has not been subjected to the optimization processing, and for convenience of distinction, it is defined as a first learnable feature a in the embodiment. And determining a loss value according to the difference between the first image feature and the first text feature, and adjusting the parameter of the first learnable feature A with the aim of reducing the loss value to obtain a first learnable feature after first optimization, wherein the first learnable feature is defined as a first learnable feature B in the embodiment for convenience of distinguishing.
And extracting first image features corresponding to the sample human body image through an image encoder, and extracting first text features corresponding to the first pedestrian re-recognition prompt word through a text encoder. The first learnable feature in the current first pedestrian re-recognition prompt corresponds to the first learnable feature B. And determining a loss value according to the difference between the first image feature and the first text feature, and adjusting the parameter of the first leachable feature B with the aim of reducing the loss value to obtain the first leachable feature after optimization again.
And then continuously repeating the process, placing the optimized first learnable feature obtained in the previous repetition into a first pedestrian re-recognition prompt word each time, extracting a first text feature corresponding to the first pedestrian re-recognition prompt word through a text encoder, extracting a first image feature corresponding to a sample human body image through an image encoder, and determining a loss value based on the difference between the first image feature and the first text feature. The method may not be repeated when the loss value is detected to be smaller than the set value, and the corresponding first learnable feature when the loss value is determined to be smaller than the set value is the learnable feature. The setting values described above may be set according to actual conditions, and the present embodiment is not limited.
In the process of optimizing the first learnable feature, only the first learnable feature is adjusted, and weights of the image encoder and the text encoder are not adjusted and are corresponding weights when the pre-training is completed.
By means of the arrangement, the learnable features obtained by aligning the image features output by the image encoder and the text features output by the text encoder are introduced into the pedestrian re-recognition prompt word, so that the text features of the pedestrian re-recognition prompt word after being encoded by the text encoder can be more suitable for a pedestrian re-recognition task.
In the embodiment of the application, a sample human body image is input into an image encoder to obtain the image characteristics output by the image encoder, and a pedestrian re-recognition prompt word is input into a text encoder to obtain the text characteristics output by the text encoder.
S102, optimizing parameters of an image encoder by taking the consistency of a pedestrian recognition result determined based on the image features and the text features and a pedestrian recognition label corresponding to a sample human body image as a target, wherein the image encoder is used for constructing and obtaining a pedestrian re-recognition model.
The pedestrian recognition result can be determined based on the image feature and the text feature. Specifically, the image features and the text features can be fused, and then the pedestrian recognition result is obtained after the processes of activation, normalization and the like.
The pedestrian recognition result comprises pedestrian information in a sample human body image, which is determined based on the image characteristics and the text characteristics; the pedestrian identification label corresponding to the sample human body image refers to pedestrian information in the sample human body image marked in advance based on the real content of the sample human body image. In some embodiments, pedestrian information in the sample human body image may be labeled by adopting a manual labeling mode, and the embodiment is not limited.
The parameters of the image encoder can be optimized with the aim that the pedestrian recognition result is consistent with the pedestrian recognition tag. The process of optimizing the parameters of the image encoder with the pedestrian recognition result being consistent with the pedestrian recognition tag is actually a supervised training process for the image encoder based on the pedestrian recognition result and the pedestrian recognition tag.
After the pedestrian recognition result is obtained, a difference between the pedestrian recognition result and the pedestrian recognition tag may be determined. The difference between the pedestrian recognition result and the pedestrian recognition tag can be characterized by a Loss function, for example, L1-Loss, L2-Loss, infoNCE-Loss, etc., and the present embodiment is not limited. And optimizing parameters of the image encoder with the aim of reducing the difference between the pedestrian recognition result and the pedestrian recognition label.
The specific training process is as follows:
And extracting image features corresponding to the sample human body image through an image encoder, extracting text features corresponding to the pedestrian re-recognition prompt words through a text encoder, and determining a pedestrian recognition result according to the image features and the text features. And taking the difference between the pedestrian recognition result and the pedestrian recognition tag as the loss of the image encoder, aiming at reducing the loss of the image encoder, and adjusting the parameters of the image encoder. And repeating the training process until the loss value of the image encoder is smaller than the set value, and finishing the training of the image encoder. It should be noted that the set value may be set according to actual situations, and the present embodiment is not limited.
Further, a pedestrian re-recognition model can be constructed based on the trained image encoder. In some embodiments, when the pedestrian re-recognition is performed, the image to be detected may be input into a pre-trained pedestrian re-recognition model, so as to obtain a detection result output by the pedestrian re-recognition model.
In the above embodiment, when the parameters of the image encoder are adjusted, the parameters of the text encoder are frozen, i.e., the parameters of the text encoder are not adjusted.
The text features obtained by extracting the characteristics of the learner features and the descriptive text of the sample human body images are more suitable for the task of re-identifying pedestrians, and the image encoder is finely adjusted by utilizing the text features and the image features, so that the image features output by the image encoder can have identity recognition degree and are also more suitable for the task of re-identifying pedestrians, thereby completing the task of re-identifying pedestrians. Moreover, the image encoder is trained from two modes of text and image, so that the feature extraction capability of the image encoder can be effectively improved.
In the above embodiment, the image encoder is used to extract the image features corresponding to the sample human body image, and the text encoder is used to extract the text features corresponding to the pedestrian re-recognition prompt word; the pedestrian re-recognition prompt word comprises a learner-driven feature and a descriptive text of the sample human body image, wherein the learner-driven feature is determined by aligning the image feature output by the image encoder and the text feature output by the text encoder; and optimizing parameters of an image encoder by taking the consistency of a pedestrian recognition result determined based on the image characteristics and the text characteristics with a pedestrian recognition label corresponding to the sample human body image as a target, wherein the image encoder is used for constructing and obtaining a pedestrian re-recognition model. By means of the arrangement, the pedestrian re-recognition model can be trained from two modes of the text and the image, the feature extraction capacity of the pedestrian re-recognition model is improved, learning features obtained by aligning the image features output by the image encoder and the text features output by the text encoder are introduced into the pedestrian re-recognition prompt words, the text features of the pedestrian re-recognition prompt words after being encoded by the text encoder can be more suitable for the pedestrian re-recognition task, and therefore the purposes of improving the recognition precision of the pedestrian re-recognition technology and the application effect of the pedestrian re-recognition technology are achieved.
As an optional implementation manner, the training method of the pedestrian re-recognition model in the above embodiment may specifically include the following steps:
Extracting an area image feature corresponding to a specific part of a human body from the first image feature, and extracting a second text feature corresponding to a second pedestrian re-recognition prompt word through a text encoder; the learner-based feature is optimized with the aim of aligning the region image feature and the second text feature.
In the embodiments of the present application, in addition to determining a learnable feature by aligning an image feature output from an image encoder and a text feature output from a text encoder according to the descriptions of the above embodiments, the learnable feature may be further optimized based on a fine-grained feature.
Specifically, based on the first image feature, an area image feature corresponding to a specific part of the human body is extracted. The specific parts of the human body can comprise parts such as a human head, an upper body, a lower body, a knapsack, shoes and the like. For example, if the specific part of the human body is a human head, the image features of the region corresponding to the specific part of the human body are the image features corresponding to the head of the human body, and if the specific part of the human body is a backpack, the image features of the region corresponding to the specific part of the human body are the image features corresponding to the backpack.
And inputting the second pedestrian re-recognition prompt word into the text encoder to obtain a second text characteristic output by the text encoder. The second pedestrian re-recognition prompt word comprises a leachable feature, a descriptive text of a sample human body image and a prompt text for identifying a specific part of a human body.
The learner characteristic included in the second pedestrian re-recognition prompt means a learner characteristic obtained by optimizing the first learner characteristic in the steps of the above embodiment. The prompt text for identifying the specific part of the human body refers to the prompt words corresponding to the specific part of the human body. For example, if the specific part of the human body is a human head, the prompt text identifying the specific part of the human body is "head", and if the specific part of the human body is a shoe, the prompt text identifying the specific part of the human body is "shoes".
And fusing the learnable features, the description text of the sample human body image and the prompt text for identifying the specific part of the human body together, so that a second pedestrian re-recognition prompt word can be obtained. The second pedestrian re-recognition prompt may be "a photo of a [ x1] [ x2] [ x3] [ x4] person' { attribute }", where "{ attribute }" is a prompt text identifying a specific part of the human body, and is generally non-learnable information; "[ x1] [ x2] [ x3] [ x4]" is a learnable feature, and "a photo of a person" is descriptive text of a sample human body image.
For example, if the specific part of the human body is a human head, the second pedestrian re-recognition prompt word may be "aphoto of a [ x1] [ x2] [ x3] [ x4] person' head", and the image feature of the region corresponding to the specific part of the human body is the image feature corresponding to the human head position; if the specific part of the human body is a shoe, the second pedestrian re-recognition prompt word may be "a photo of a [ x1] [ x2] [ x3] [ x4] person' shoes", and the image feature of the region corresponding to the specific part of the human body is the image feature corresponding to the shoe part.
The learner-based feature is optimized with the aim of aligning the region image feature and the second text feature. In particular, the learner-based feature may be optimized with the goal of pixel-level alignment of the region image feature and the second text feature.
After the regional image feature and the second text feature are obtained, a difference between the regional image feature and the second text feature can be determined. The difference between the region image feature and the second text feature may be characterized by a Loss function, such as L1-Loss, L2-Loss, infoNCE-Loss, etc., and the present embodiment is not limited. The learnable features are optimized with the aim of reducing the difference between the regional image features and the second text features.
It should be noted that, by taking the alignment region image feature and the second text feature as the target, the specific manner of optimizing the learnable feature is the same as the specific manner of optimizing the first learnable feature by taking the alignment first image feature and the first text feature as the target in the above embodiment, and a person skilled in the art may refer to the description of the above embodiment, and will not be repeated herein.
In some embodiments, if the learnable feature obtained by optimizing the learnable feature with the aim of reducing the difference between the regional image feature and the second text feature is defined as an optimized learnable feature, the learnable feature included in the pedestrian re-recognition prompt word may be the optimized learnable feature.
By means of the arrangement, the learner-based image recognition method and device can further optimize the learnable features by utilizing fine-grained visual features such as specific parts of a human body, the text features of the pedestrian re-recognition prompt words after being encoded by the text encoder are further suitable for the pedestrian re-recognition task, the image encoder is subjected to fine adjustment by utilizing the text features and the image features, the image features output by the image encoder are enabled to have identity recognition degree, and meanwhile the learner-based image recognition method and device are more suitable for the pedestrian re-recognition task, and therefore the pedestrian re-recognition task can be completed.
As an optional implementation manner, in another embodiment of the present application, the step of extracting, from the first image feature, an area image feature corresponding to a specific part of a human body in the above embodiment may specifically include the following steps:
segmenting images of different human body parts from a sample human body image; the region image features are determined based on the first image features and images corresponding to the specific part of the human body in the images of the different human body parts.
Specifically, the image segmentation processing can be performed on the sample human body image to obtain images of different human body parts. For example, if a sample human body image includes several parts of a human head, an upper body, a lower body, a backpack, and a shoe, an image of the human head, an image of the upper body, an image of the lower body, an image of the backpack, and an image of the shoe can be obtained by performing image segmentation processing on the sample human body image.
The first image feature is a feature of the sample human body image as a whole. In the embodiment of the application, the image features corresponding to the image of the specific part of the human body are extracted from the first image features and are taken as the regional image features. For example, if the specific part of the human body is the upper body, the image feature corresponding to the image of the upper body is extracted from the first image feature as the regional image feature.
In the above embodiments, the fine-grained visual feature can be extracted by performing image segmentation so as to further optimize the learnable feature based on the fine-grained visual feature.
As an alternative implementation manner, in another embodiment of the present application, the steps of the above embodiment divide the image of different human body parts from the sample human body image, and may specifically include the following steps:
and segmenting images of different human body parts in the sample human body image through the human body region segmentation model.
Specifically, the sample human body image may be input into a human body region segmentation model trained in advance, so as to obtain images of different human body parts in the sample human body image output by the human body region segmentation model.
The human body region segmentation model is obtained by taking a human body image as a training sample and taking images of different human body parts in the human body image as targets. Specifically, a large number of human body images can be acquired as training samples, and different human body parts in each human body image are marked as training labels.
The specific training process is as follows:
And inputting the training sample into the human body region segmentation model to obtain a prediction result output by the human body region segmentation model. And determining the loss value of the human body region segmentation model by comparing the prediction result output by the human body region segmentation model with the training label, and adjusting the parameters of the human body region segmentation model by taking the loss value of the human body region segmentation model as a target. And repeating the training process until the loss value of the human body region segmentation model is smaller than the set value. The setting value may be set according to actual conditions, and the present embodiment is not limited.
The human body region segmentation model can be obtained by training based on any neural network model or can be obtained by training based on a pre-training model, such as a pre-training large model similar to chatgpt.
After the human body region segmentation model is trained, image segmentation of different human body parts can be performed based on the human body region segmentation model.
In the above embodiment, the human body region segmentation model is utilized to quickly and accurately segment images corresponding to different human body parts, so that fine-grained visual features are quickly extracted, and the learnable features are further optimized based on the fine-grained visual features.
As an alternative implementation manner, it is disclosed in another embodiment of the present application that the steps in the above embodiment may specifically include the following steps before extracting, by the image encoder, the image feature corresponding to the sample human body image, and extracting, by the text encoder, the text feature corresponding to the pedestrian re-recognition prompt word:
Extracting second image features corresponding to the sample human body images through an image encoder, and extracting second text features corresponding to second pedestrian re-recognition prompt words through a text encoder; extracting region image features corresponding to specific parts of the human body from the second image features; and optimizing parameters of the image encoder by taking the alignment region image characteristic and the second text characteristic as targets to obtain the optimized image encoder.
Before the image encoder is subjected to supervised training based on the pedestrian recognition result and the pedestrian recognition tag, parameters of the image encoder can be optimized based on fine-grained visual features.
Specifically, the sample human body image is input into an image encoder, and the second image characteristic output by the image encoder is obtained. The sample human body image used when the parameters of the image encoder are optimized may be derived from a sample human body image obtained in advance, a sample human body image obtained when the learnable features are optimized and trained, a sample human body image obtained when the pedestrian re-recognition model is trained, and a sample human body image obtained when the parameters of the image encoder are optimized, and may include the same portion or different portions of the sample human body image obtained in advance, which is not limited in this embodiment.
And inputting the second pedestrian re-recognition prompt word into the text encoder to obtain a second text characteristic output by the text encoder. The second pedestrian re-recognition prompt word comprises a leachable feature, a descriptive text of a sample human body image and a prompt text for identifying a specific part of a human body.
The learnable features included in the second pedestrian re-recognition prompt word may be learnable features obtained by optimizing the learnable features with the alignment of the regional image features and the second text features as targets, or may be learnable features obtained by optimizing the first learnable features with the alignment of the first image features and the first text features as targets, which is not limited in this embodiment.
The description text of the sample human body image in this embodiment has the same meaning as the description text of the sample human body image in the above embodiment, and those skilled in the art only need to refer to the description of the above embodiment, and no description is repeated here. The prompt text identifying the specific part of the human body in this embodiment has the same meaning as the prompt text identifying the specific part of the human body in the above embodiment, and those skilled in the art refer to the description of the above embodiment, and will not be repeated here.
And extracting the regional image features corresponding to the specific parts of the human body from the second image features. The step of extracting the region image feature corresponding to the specific part of the human body from the second image feature is the same as that of extracting the region image feature corresponding to the specific part of the human body from the first image feature in the above embodiment, and is only needed in the art by referring to the description in the above embodiment, and will not be described herein.
And optimizing parameters of the image encoder by taking the alignment region image characteristic and the second text characteristic as targets to obtain the optimized image encoder.
After the regional image feature and the second text feature are obtained, a difference between the regional image feature and the second text feature can be determined. The difference between the region image feature and the second text feature may be characterized by a Loss function, such as L1-Loss, L2-Loss, infoNCE-Loss, etc., and the present embodiment is not limited. The parameters of the image encoder are optimized with the aim of reducing the difference between the regional image feature and the second text feature.
The specific optimization process is as follows:
Extracting second image features corresponding to the sample human body images through an image encoder, and extracting second text features corresponding to second pedestrian re-recognition prompt words through a text encoder; extracting region image features corresponding to specific parts of the human body from the second image features; a loss value is determined based on a difference between the region image feature and the second text feature, and parameters of the image encoder are adjusted with the goal of reducing the loss value. And repeating the steps until the loss value is smaller than the set value, and completing optimization of the image encoder. It should be noted that the set value may be set according to actual situations, and the present embodiment is not limited.
In some embodiments, if the parameters of the image encoder are optimized with the alignment region image feature and the second text feature as targets, the optimized image encoder is defined as an optimized image encoder, and when the image encoder is subjected to supervised training based on the pedestrian recognition result and the pedestrian recognition label, the image feature corresponding to the sample human body image is extracted through the optimized image encoder, the text feature corresponding to the pedestrian re-recognition prompt word is extracted through the text encoder, the parameters of the image encoder are optimized with the pedestrian recognition label corresponding to the sample human body image and the pedestrian recognition result determined based on the image feature and the text feature being consistent as targets, and the image encoder is used for constructing and obtaining the pedestrian re-recognition model.
In the above embodiment, the parameters of the image encoder are optimized based on the fine-grained visual characteristics, so that the characteristic characterization capability of the image encoder can be improved.
In the above embodiment, when the parameters of the image encoder are optimized, the parameters of the text encoder are frozen, i.e., the parameters of the text encoder are not adjusted.
As an optional implementation manner, another embodiment of the present application discloses a training method for a pedestrian re-recognition model in the above embodiment, which specifically may include the following steps:
Calculating the similarity between each image feature extraction result and each text feature extraction result; and determining a pedestrian recognition result according to the image feature extraction result and the text feature extraction result with the highest similarity.
Specifically, a sample human body image is input into an image encoder, and a plurality of image feature extraction results are generally obtained from image features output by the image encoder; the pedestrian re-recognition prompt word is input into a text encoder, and a plurality of text feature extraction results are generally obtained for the text features output by the text encoder.
In the embodiment of the application, the similarity between each image feature extraction result and each text feature extraction result is calculated, the similarity between each image feature extraction result and each text feature extraction result can be calculated by calculating an inner product, and the similarity between each image feature extraction result and each text feature extraction result can be calculated by calculating a cosine similarity. And determining an image feature extraction result and a text feature extraction result with the highest similarity, and determining a pedestrian recognition result according to the image feature extraction result and the text feature extraction result with the highest similarity.
In the above embodiment, the pedestrian recognition result may be determined by calculating the similarity, so that the image encoder may be trained in a supervised manner based on the pedestrian recognition result and the pedestrian recognition tag, so that the image encoder has the capability of recognizing pedestrian information.
Based on the above embodiments, the training of the pedestrian re-recognition model can be divided into two parts, one part is the optimization of the learnable features, and the other part is the adjustment of the image encoder parameters.
FIG. 2 is a schematic diagram illustrating optimization of a learnable feature. As shown in fig. 2, a sample human body image may be input into an image encoder (Image Encoder) to obtain a first image feature (Image Embeddings 1) output by the image encoder, a first pedestrian re-recognition prompt word is input into a text encoder (Text Encoder) to obtain a first text feature (Text Embeddings) output by the text encoder, and the first learner feature is optimized with the aim of aligning the first image feature with the first text feature to obtain a learner feature (L stage1).
Performing human body analysis (Human Parsing) on the sample human body image to obtain an image corresponding to a specific part of the human body, and obtaining regional image characteristics (Visual Feature Map) based on the image corresponding to the specific part of the human body and the first image characteristics; inputting a second pedestrian re-recognition prompt word into a Text encoder to obtain a second Text Feature (Text Feature Map) output by the Text encoder; and (3) aiming at aligning the regional image features and the second text features, optimizing the learnable features to obtain optimized learnable features (L part).
Fig. 3 is a schematic diagram of parameter adjustment of an image encoder. As shown in fig. 3, a sample human body image may be input into an image encoder (Image Encoder) to obtain image features (Image Embeddings) output by the image encoder; human body analysis is carried out on the sample human body image (Human Parsing) to obtain an image corresponding to the specific part of the human body, and region image characteristics are obtained based on the image corresponding to the specific part of the human body and the image characteristics (Visual Feature Map). And inputting the second pedestrian re-recognition prompt word into a Text encoder (Text Encoder) to obtain a second Text Feature (Text Feature Map) output by the Text encoder, and optimizing parameters of the image encoder by taking the alignment area image and the second Text Feature as targets to obtain an optimized image encoder (L align).
Inputting the pedestrian re-recognition prompt word into a text encoder to obtain text characteristics (Text Embeddings) output by the text encoder, determining a pedestrian recognition result in a mode of calculating an inner product, and optimizing parameters of the image encoder by taking the pedestrian recognition result as a target and the pedestrian recognition label corresponding to the sample human body image as a target to obtain an optimized image encoder (L stage2).
The embodiment of the application provides a pedestrian re-identification method which can be executed by electronic equipment, wherein the electronic equipment can be any equipment with data and instruction processing functions, such as a computer, an intelligent terminal, a server and the like. Referring to fig. 4, the method includes:
s201, inputting the image to be detected into a pre-trained pedestrian re-recognition model to obtain a detection result output by the pedestrian re-recognition model.
Specifically, the image to be detected may be input into a pre-trained pedestrian re-recognition model, and the pedestrian re-recognition model processes the image to be detected and outputs a detection result.
The image encoder of the pedestrian re-recognition model is obtained by training the training method of the pedestrian re-recognition model according to any one of the embodiments.
Corresponding to the training method of the pedestrian re-recognition model, the embodiment of the application also discloses a training device of the pedestrian re-recognition model, and referring to fig. 5, the device comprises:
The extraction module 100 is configured to extract, by using an image encoder, image features corresponding to the sample human body image, and extract, by using a text encoder, text features corresponding to the pedestrian re-recognition prompt word; the pedestrian re-recognition prompt word comprises a learner-driven feature and a descriptive text of the sample human body image, wherein the learner-driven feature is determined by aligning the image feature output by the image encoder and the text feature output by the text encoder;
The optimizing module 110 is configured to optimize parameters of an image encoder, and the image encoder is configured to construct and obtain a pedestrian re-recognition model, with a goal that a pedestrian recognition result determined based on the image feature and the text feature is consistent with a pedestrian recognition tag corresponding to the sample human body image.
Further, in the training device of the pedestrian re-recognition model, the optimizing module 110 is further configured to:
Extracting first image features corresponding to the sample human body image through an image encoder, and extracting first text features corresponding to the first pedestrian re-recognition prompt words through a text encoder; the first pedestrian re-recognition prompt word comprises a first leachable feature and a descriptive text of a sample human body image; and aiming at aligning the first image feature with the first text feature, optimizing the first learner-based feature to obtain the learner-based feature.
Further, in the training device of the pedestrian re-recognition model, the optimizing module 110 is further configured to:
Extracting an area image feature corresponding to a specific part of a human body from the first image feature, and extracting a second text feature corresponding to a second pedestrian re-recognition prompt word through a text encoder; the second pedestrian re-recognition prompt word comprises a leachable feature, a descriptive text of a sample human body image and a prompt text for identifying a specific part of the human body; the learner-based feature is optimized with the aim of aligning the region image feature and the second text feature.
Further, in the training device of the pedestrian re-recognition model, the optimization module 110 is specifically configured to:
segmenting images of different human body parts from a sample human body image; the region image features are determined based on the first image features and images corresponding to the specific part of the human body in the images of the different human body parts.
Further, in the training device of the pedestrian re-recognition model, the optimization module 110 is specifically configured to:
The method comprises the steps of segmenting images of different human body parts in a sample human body image through a human body region segmentation model; the human body region segmentation model is obtained by taking a human body image as a training sample and taking images of different human body parts in the human body image as targets.
Further, in the training device of the pedestrian re-recognition model, the optimizing module 110 is further configured to:
Before extracting image features corresponding to the sample human body image through the image encoder and text features corresponding to the pedestrian re-recognition prompt words through the text encoder, extracting second image features corresponding to the sample human body image through the image encoder and second text features corresponding to the second pedestrian re-recognition prompt words through the text encoder; the second pedestrian re-recognition prompt word comprises a leachable feature, a descriptive text of a sample human body image and a prompt text for identifying a specific part of the human body; extracting region image features corresponding to specific parts of the human body from the second image features; and optimizing parameters of the image encoder by taking the alignment region image characteristic and the second text characteristic as targets to obtain the optimized image encoder.
Further, in the training device of the pedestrian re-recognition model, the image features include a plurality of image feature extraction results, the text features include a plurality of text feature extraction results, and the optimization module 110 is specifically configured to:
Calculating the similarity between each image feature extraction result and each text feature extraction result; and determining a pedestrian recognition result according to the image feature extraction result and the text feature extraction result with the highest similarity.
Specifically, the training device for the pedestrian re-recognition model provided in this embodiment belongs to the same application concept as the training method for the pedestrian re-recognition model provided in the foregoing embodiment of the present application, and the training method for the pedestrian re-recognition model provided in any of the foregoing embodiments of the present application may be executed, and has functional units and beneficial effects corresponding to the training method for the pedestrian re-recognition model. Technical details not described in detail in the present embodiment may refer to specific processing content of the training method of the pedestrian re-recognition model provided in the foregoing embodiment of the present application, and will not be described herein.
The functions implemented by the above modules may be implemented by the same or different processors, respectively, and embodiments of the present application are not limited.
Corresponding to the pedestrian re-recognition method, the embodiment of the application also discloses a pedestrian re-recognition device, as shown in fig. 6, which comprises:
The pedestrian re-recognition module 200 is configured to input an image to be detected into a pre-trained pedestrian re-recognition model, and obtain a detection result output by the pedestrian re-recognition model; the image encoder of the pedestrian re-recognition model is trained by the training method of the pedestrian re-recognition model of any one of the above embodiments.
Specifically, the pedestrian re-recognition device provided in this embodiment belongs to the same application concept as the pedestrian re-recognition method provided in the foregoing embodiment of the present application, and the pedestrian re-recognition method provided in any of the foregoing embodiments of the present application may be executed, which has a functional unit and beneficial effects corresponding to the execution of the pedestrian re-recognition method. Technical details not described in detail in the present embodiment may refer to specific processing content of the pedestrian re-recognition method provided in the foregoing embodiment of the present application, and will not be described herein.
It will be appreciated that the elements of the above apparatus may be implemented in the form of processor-invoked software. For example, the device includes a processor, where the processor is connected to a memory, and the memory stores instructions, and the processor invokes the instructions stored in the memory to implement any of the methods above or to implement functions of each unit of the device, where the processor may be a general-purpose processor, such as a CPU or a microprocessor, and the memory may be a memory within the device or a memory outside the device. Or the units in the device may be implemented in the form of hardware circuits, and the functions of some or all of the units may be implemented by designing a hardware circuit, where the hardware circuit may be understood as one or more processors; for example, in one implementation, the hardware circuit is an ASIC, and the functions of some or all of the above units are implemented by designing the logic relationships of the elements in the circuit; for another example, in another implementation, the hardware circuit may be implemented by a PLD, for example, an FPGA may include a large number of logic gates, and the connection relationship between the logic gates is configured by a configuration file, so as to implement the functions of some or all of the above units. All units of the above device may be realized in the form of processor calling software, or in the form of hardware circuits, or in part in the form of processor calling software, and in the rest in the form of hardware circuits.
In an embodiment of the present application, the processor is a circuit with signal processing capability, and in an implementation, the processor may be a circuit with instruction reading and running capability, such as a CPU, a microprocessor, a GPU, or a DSP, etc.; in another implementation, the processor may implement a function through a logical relationship of hardware circuitry that is fixed or reconfigurable, e.g., a hardware circuit implemented by the processor as an ASIC or PLD, such as an FPGA, or the like. In the reconfigurable hardware circuit, the processor loads the configuration document, and the process of implementing the configuration of the hardware circuit may be understood as a process of loading instructions by the processor to implement the functions of some or all of the above units. Furthermore, a hardware circuit designed for artificial intelligence may be provided, which may be understood as an ASIC, such as NPU, TPU, DPU, etc.
It will be seen that each of the units in the above apparatus may be one or more processors (or processing circuits) configured to implement the above method, for example: CPU, GPU, NPU, TPU, DPU, microprocessors, DSP, ASIC, FPGA, or a combination of at least two of these processor forms.
Furthermore, the units in the above apparatus may be integrated together in whole or in part, or may be implemented independently. In one implementation, these units are integrated together and implemented in the form of an SOC. The SOC may include at least one processor for implementing any of the methods above or for implementing the functions of the units of the apparatus, where the at least one processor may be of different types, including, for example, a CPU and an FPGA, a CPU and an artificial intelligence processor, a CPU and a GPU, and the like.
The embodiment of the application also provides a control device which comprises a processor and an interface circuit, wherein the processor in the control device is connected with the input and output assembly through the interface circuit of the control device.
The input/output module is specifically a hardware module that enables a user to input information and output information to the user, and may be, for example, a microphone, a keyboard, a handwriting pad, a touch screen, a display, a sound, a printer, or the like.
The interface circuit may be any interface circuit capable of implementing a data communication function, for example, a USB interface circuit, a Type-C interface circuit, a serial interface circuit, a PCIE circuit, or the like.
The processor in the control device is a circuit having a signal processing capability capable of executing any one of the training methods of the pedestrian re-recognition model and/or the pedestrian re-recognition method described in the above embodiments. The specific implementation manner of the processor may be referred to above, and embodiments of the present application are not limited strictly.
When the control device is applied to equipment with a man-machine interaction function, the input and output components of the control device can be input components and output components on the equipment, such as a microphone, a keyboard, a handwriting board, a touch screen, a display, an audio player and the like, meanwhile, the processor of the control device can be a CPU or a GPU and the like of the equipment, and the interface circuit of the control device can be an interface circuit between the information input components of the equipment and the processor of the CPU or the GPU and the like.
The embodiment of the application also discloses an electronic device corresponding to the training method of the pedestrian re-recognition model and/or the pedestrian re-recognition method, as shown in fig. 7, the electronic device comprises:
A memory 300 and a processor 310;
Wherein the memory 300 is connected to the processor 310 for storing a program;
the processor 310 is configured to implement the training method of the pedestrian re-recognition model and/or the pedestrian re-recognition method disclosed in any of the above embodiments by running the program stored in the memory 300.
Specifically, the electronic device may further include: a bus, a communication interface 320, an input device 330, and an output device 340.
The processor 310, the memory 300, the communication interface 320, the input device 330 and the output device 340 are interconnected by a bus. Wherein:
a bus may comprise a path that communicates information between components of a computer system.
The processor 310 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present application. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Processor 310 may include a host processor, and may also include a baseband chip, modem, and the like.
The memory 300 stores programs for implementing the technical scheme of the present application, and may also store an operating system and other key services. In particular, the program may include program code including computer-operating instructions. More specifically, memory 300 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.
The input device 330 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.
Output device 340 may include means, such as a display screen, printer, speakers, etc., that allow information to be output to a user.
Communication interface 320 may include devices that use any type of transceiver to communicate with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.
The processor 310 executes the program stored in the memory 300 and invokes other devices that may be used to implement the training method of the pedestrian re-recognition model and/or the steps of the pedestrian re-recognition method provided in the above embodiments of the present application.
In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the training method of the pedestrian re-recognition model provided by the above embodiments, and/or the steps of the pedestrian re-recognition method.
The computer program product may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium, on which computer program instructions are stored, which, when executed by a processor, cause the processor to perform the training method of the pedestrian re-recognition model provided by the above embodiments, and/or the respective steps of the pedestrian re-recognition method.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will recognize that the embodiments described in the specification are all of the preferred embodiments, and that the acts and elements referred to are not necessarily required for the present application.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.
In the embodiments of the present application, the units and sub-units in the terminal may be combined, divided, and pruned according to actual needs.
In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, e.g., the division of units or sub-units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple sub-units or units may be combined or integrated into another unit, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other forms.
The elements or sub-elements illustrated as separate elements may or may not be physically separate, and elements as elements or sub-elements may or may not be physically located, i.e. may be located in one place, or may be distributed over a plurality of network elements or sub-elements. Some or all of the units or sub-units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit or sub-unit in the embodiments of the present application may be integrated in one processing unit, or each unit or sub-unit may exist alone physically, or two or more units or sub-units may be integrated in one unit. The integrated units or sub-units described above may be implemented either in hardware or in software functional units or sub-units.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A training method of a pedestrian re-recognition model, comprising:
Extracting image features corresponding to the sample human body image through an image encoder, and extracting text features corresponding to the pedestrian re-recognition prompt words through a text encoder; the pedestrian re-recognition prompt comprises a learnable feature and a descriptive text of the sample human body image, wherein the learnable feature is determined by aligning an image feature output by the image encoder and a text feature output by the text encoder;
And optimizing parameters of the image encoder by taking the consistency of the pedestrian recognition result determined based on the image features and the text features and the pedestrian recognition label corresponding to the sample human body image as targets, wherein the image encoder is used for constructing and obtaining a pedestrian re-recognition model.
2. The method according to claim 1, wherein the method further comprises:
Extracting first image features corresponding to the sample human body image through the image encoder, and extracting first text features corresponding to the first pedestrian re-recognition prompt word through the text encoder; the first pedestrian re-recognition prompt word comprises a first leachable feature and a descriptive text of the sample human body image;
and aiming at aligning the first image feature with the first text feature, optimizing the first learner feature to obtain the learner feature.
3. The method according to claim 2, wherein the method further comprises:
Extracting region image features corresponding to specific parts of human bodies from the first image features, and extracting second text features corresponding to second pedestrian re-recognition prompt words through the text encoder; the second pedestrian re-recognition prompt word comprises the leachable feature, a descriptive text of the sample human body image and a prompt text for identifying a specific part of the human body;
The learner-based feature is optimized with the goal of aligning the region image feature and the second text feature.
4. A method according to claim 3, wherein the extracting the region image feature corresponding to the specific part of the human body from the first image feature comprises:
segmenting images of different human body parts from the sample human body image;
And determining the regional image characteristics based on the first image characteristics and the images corresponding to the specific part of the human body in the images of the different human body parts.
5. The method of claim 4, wherein segmenting the image of the different body parts from the sample body image comprises:
dividing images of different human body parts in the sample human body image through a human body region division model;
The human body region segmentation model is obtained by taking a human body image as a training sample and taking images of different human body parts in the human body image as targets.
6. The method according to claim 1, wherein before extracting, by the image encoder, the image features corresponding to the sample human body image, and extracting, by the text encoder, the text features corresponding to the pedestrian re-recognition prompt, further comprises:
Extracting second image features corresponding to the sample human body images through the image encoder, and extracting second text features corresponding to the second pedestrian re-recognition prompt words through the text encoder; the second pedestrian re-recognition prompt word comprises the leachable feature, a descriptive text of the sample human body image and a prompt text for identifying a specific part of the human body;
Extracting region image features corresponding to the specific part of the human body from the second image features;
And optimizing parameters of the image encoder by aiming at aligning the regional image features with the second text features to obtain an optimized image encoder.
7. The method of claim 1, wherein the image features comprise a plurality of image feature extraction results and the text features comprise a plurality of text feature extraction results;
the method further comprises the steps of:
Calculating the similarity between each image feature extraction result and each text feature extraction result;
And determining the pedestrian recognition result according to the image feature extraction result and the text feature extraction result with the highest similarity.
8. A pedestrian re-recognition method, characterized by comprising:
Inputting an image to be detected into a pre-trained pedestrian re-recognition model to obtain a detection result output by the pedestrian re-recognition model; the image encoder of the pedestrian re-recognition model is trained by the training method of the pedestrian re-recognition model according to any one of claims 1 to 7.
9. A training device for a pedestrian re-recognition model, comprising:
The extraction module is used for extracting image features corresponding to the sample human body images through the image encoder and extracting text features corresponding to the pedestrian re-recognition prompt words through the text encoder; the pedestrian re-recognition prompt comprises a learnable feature and a descriptive text of the sample human body image, wherein the learnable feature is determined by aligning an image feature output by the image encoder and a text feature output by the text encoder;
And the optimization module is used for optimizing parameters of the image encoder by taking the consistency of the pedestrian recognition result determined based on the image features and the text features and the pedestrian recognition label corresponding to the sample human body image as a target, and the image encoder is used for constructing and obtaining a pedestrian re-recognition model.
10. A pedestrian re-recognition device, characterized by comprising:
the pedestrian re-recognition module is used for inputting the image to be detected into a pre-trained pedestrian re-recognition model to obtain a detection result output by the pedestrian re-recognition model; the image encoder of the pedestrian re-recognition model is trained by the training method of the pedestrian re-recognition model according to any one of claims 1 to 7.
11. An electronic device, comprising:
A memory and a processor;
Wherein the memory is used for storing programs;
The processor is configured to implement the method according to any one of claims 1 to 7 and/or the method according to claim 8 by running a program in the memory.
12. A computer program product, characterized in that it comprises a computer program or computer instructions which, when executed by a processor, implement the method according to any one of claims 1 to 7 and/or implement the method according to claim 8.
CN202410155047.9A 2024-02-02 2024-02-02 Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device Pending CN118155275A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410155047.9A CN118155275A (en) 2024-02-02 2024-02-02 Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410155047.9A CN118155275A (en) 2024-02-02 2024-02-02 Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device

Publications (1)

Publication Number Publication Date
CN118155275A true CN118155275A (en) 2024-06-07

Family

ID=91293850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410155047.9A Pending CN118155275A (en) 2024-02-02 2024-02-02 Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device

Country Status (1)

Country Link
CN (1) CN118155275A (en)

Similar Documents

Publication Publication Date Title
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN111985240A (en) Training method of named entity recognition model, named entity recognition method and device
Lei et al. Scene text recognition using residual convolutional recurrent neural network
US20220230648A1 (en) Method, system, and non-transitory computer readable record medium for speaker diarization combined with speaker identification
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN111931859B (en) Multi-label image recognition method and device
CN111653274B (en) Wake-up word recognition method, device and storage medium
CN113569607A (en) Motion recognition method, motion recognition device, motion recognition equipment and storage medium
CN114764869A (en) Multi-object detection with single detection per object
CN116958512A (en) Target detection method, target detection device, computer readable medium and electronic equipment
Rong et al. Guided text spotting for assistive blind navigation in unfamiliar indoor environments
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
Hisham et al. A Systematic Literature Review of the Mobile Application for Object Recognition for Visually Impaired People
CN113065634B (en) Image processing method, neural network training method and related equipment
CN114077666A (en) Dialog intention classification method, apparatus and non-volatile computer storage medium
Lahiani et al. Hand gesture recognition system based on local binary pattern approach for mobile devices
CN112560690A (en) Multi-modal characteristic character attribute labeling method, device, equipment and medium
CN111898528A (en) Data processing method and device, computer readable medium and electronic equipment
CN118155275A (en) Training method of pedestrian re-recognition model, pedestrian re-recognition method and related device
Mao et al. An image authentication technology based on depth residual network
Revelli et al. Automate extraction of braille text to speech from an image
Zhao et al. Coarse-to-fine online learning for hand segmentation in egocentric video
Chandrasekaran American sign language recognition and translation using deep learning and computer vision
Liu et al. Construction of a smart face recognition model for university libraries based on FaceNet-MMAR algorithm
CN111796663B (en) Scene recognition model updating method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination