WO2019153830A1 - 行人再识别方法、装置、电子设备和存储介质 - Google Patents

行人再识别方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2019153830A1
WO2019153830A1 PCT/CN2018/116600 CN2018116600W WO2019153830A1 WO 2019153830 A1 WO2019153830 A1 WO 2019153830A1 CN 2018116600 W CN2018116600 W CN 2018116600W WO 2019153830 A1 WO2019153830 A1 WO 2019153830A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
target
feature vector
video
target video
Prior art date
Application number
PCT/CN2018/116600
Other languages
English (en)
French (fr)
Inventor
陈大鹏
李鸿升
肖桐
伊帅
王晓刚
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to KR1020197038764A priority Critical patent/KR102348002B1/ko
Priority to SG11201913733QA priority patent/SG11201913733QA/en
Priority to JP2019570048A priority patent/JP6905601B2/ja
Publication of WO2019153830A1 publication Critical patent/WO2019153830A1/zh
Priority to US16/726,878 priority patent/US11301687B2/en
Priority to PH12020500050A priority patent/PH12020500050A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/469Contour-based spatial representations, e.g. vector-coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • the embodiments of the present invention relate to the field of image processing technologies, and in particular, to a pedestrian re-identification method, apparatus, electronic device, and storage medium.
  • Pedestrian re-identification is a key technology in intelligent video surveillance systems. It aims to find the same pedestrians in the target video by measuring the similarity between the given target video and the candidate video. Candidate video.
  • the current pedestrian re-identification method mainly encodes a complete video, and uses the coding result to measure the similarity between the whole target video and the entire candidate video, and the effect of pedestrian recognition is poor.
  • the embodiment of the present application provides a pedestrian re-identification technical solution.
  • a pedestrian re-identification method including: acquiring a target video including a target pedestrian and at least one candidate video; and each target video segment and at least one of the target videos Each of the candidate video segments is separately encoded; a similarity score between each of the target video segments and each of the candidate video segments is determined according to the encoding result; the similarity score is used for characterization A degree of similarity between the target video segment and a pedestrian feature in the candidate video segment; performing at least one of the candidate videos for re-recognition based on the similarity score.
  • the encoding, each of the target video segments and each of the at least one candidate video segment, respectively is encoded, including: acquiring each of the target video segments Obtaining a first target feature vector and a second target feature vector of each target video frame and an index feature vector of each of the target video segments, acquiring a first candidate feature of each candidate video frame in each of the candidate video segments a vector and a second candidate feature vector; generating an attention weight vector according to the index feature vector, the first target feature vector, and the first candidate feature vector; according to the attention weight vector, the second target feature vector, and The second candidate feature vector obtains an encoding result of each of the target video segments and an encoding result of each of the candidate video segments.
  • the acquiring a first target feature vector and a second target feature vector of each target video frame in each of the target video segments and an index feature vector of each of the target video segments acquiring each Extracting the first candidate feature vector and the second candidate feature vector of each of the candidate video segments, respectively: extracting an image feature vector of each of the target video frames and each of the candidate video frames An image feature vector; generating, according to an image feature vector of each of the target video frames, a first target feature vector and a second target feature vector of each of the target video frames and an index feature vector of each of the target video segments, according to An image feature vector of each of the candidate video frames generates a first candidate feature vector and a second candidate feature vector for each of the candidate video frames.
  • the generating the attention weight vector according to the index feature vector, the first target feature vector, and the first candidate feature vector comprises: according to the index feature vector and the first target feature The vector generates a target attention weight vector for each of the target video frames, and generates a candidate attention weight vector for each of the candidate video frames according to the index feature vector and the first candidate feature vector.
  • the generating the target attention weight vector of each of the target video frames according to the index feature vector and the first target feature vector comprises: according to the index feature vector, each of the targets Generating a target heat map of each of the target video frames by the first target feature vector of the video frame; normalizing the target heat map to obtain a target attention weight vector for each of the target video frames; and / Or generating the candidate attention weight vector of each of the candidate video frames according to the index feature vector and the first candidate feature vector, including: according to the index feature vector, each of the candidate video frames Generating, by the first candidate feature vector, a candidate heat map of each of the candidate video frames; normalizing the candidate heat map to obtain candidate attention weight vectors for each of the candidate video frames.
  • the obtaining, according to the attention weight vector, the second target feature vector, and the second candidate feature vector, an encoding result of each of the target video segments and each of the candidate video segments comprising: obtaining, according to a target attention weight vector of each of the target video frames and a second target feature vector, an encoding result of each of the target video segments, according to a candidate attention weight vector sum of each of the candidate video frames
  • the second candidate feature vector obtains an encoding result of each of the candidate video segments.
  • obtaining the encoding result of each of the target video segments according to the target attention weight vector and the second target feature vector of each of the target video frames including: The target attention weight vector is multiplied by the second target feature vector of the respective target video frame; the multiplied result of each of the target video frames is added in the time dimension to obtain a coding result of each of the target video segments; and / Or obtaining the coding result of each of the candidate video segments according to the candidate attention weight vector and the second candidate feature vector of each of the target video frames, including: selecting candidate attention weight vectors of each of the candidate video frames and respective The second candidate feature vectors of the candidate video frames are multiplied; the multiplication results of each of the candidate video frames are added in a time dimension to obtain an encoding result of each of the candidate video segments.
  • determining the similarity score between each of the target video segments and each of the candidate video segments according to the encoding result including: encoding the result of each of the target video segments with each The coding results of the candidate video segments are sequentially subjected to a subtraction operation; the result of the subtraction operation is squared in each dimension; and the feature vector obtained by the square operation is fully connected to obtain a two-dimensional feature vector; The two-dimensional feature vector is subjected to a normalization operation to obtain a similarity score between each of the target video segments and each of the candidate video segments.
  • performing the pedestrian re-identification on the at least one candidate video according to the similarity score includes: maximizing the score for each of the at least one candidate video segment
  • the similarity scores of the preset ratio thresholds are added as the similarity scores of each of the candidate videos; the similarity scores of each of the candidate videos are arranged in descending order; One or more of the candidate videos are determined to include a video of the same target pedestrian as the target video.
  • a pedestrian re-identification apparatus comprising: an acquisition module configured to acquire a target video including a target pedestrian and at least one candidate video; and an encoding module configured to be in the target video
  • Each of the target video segments and each of the at least one of the candidate videos are separately encoded;
  • a determining module configured to determine between each of the target video segments and each of the candidate video segments based on the encoding result a similarity score;
  • the similarity score is used to characterize the degree of similarity between the target video segment and the pedestrian feature in the candidate video segment;
  • the identification module is configured to match the at least one location according to the similarity score
  • the candidate video is for pedestrian re-identification.
  • the encoding module includes: a feature vector acquiring module configured to acquire a first target feature vector and a second target feature vector of each target video frame in each of the target video segments and each And an index feature vector of the target video segment, acquiring a first candidate feature vector and a second candidate feature vector of each candidate video frame in each of the candidate video segments; and a weight vector generating module configured to be configured according to the index feature a vector, the first target feature vector, and the first candidate feature vector generate an attention weight vector; the encoding result obtaining module is configured to be configured according to the attention weight vector, the second target feature vector, and the second candidate feature The vector obtains an encoding result of each of the target video segments and an encoding result of each of the candidate video segments.
  • the feature vector obtaining module is configured to respectively extract an image feature vector of each of the target video frames and an image feature vector of each of the candidate video frames; according to each of the target video frames And generating, by the image feature vector, a first target feature vector and a second target feature vector of each of the target video frames and an index feature vector of each of the target video segments, generating each image feature vector according to each of the candidate video frames a first candidate feature vector and a second candidate feature vector of the candidate video frame.
  • the weight vector generating module is configured to generate a target attention weight vector of each of the target video frames according to the index feature vector and the first target feature vector, according to the index feature vector sum
  • the first candidate feature vector generates a candidate attention weight vector for each of the candidate video frames.
  • the weight vector generating module is configured to generate a target heat map of each of the target video frames according to the index feature vector and the first target feature vector of each of the target video frames; Performing a normalization process on the target heat map to obtain a target attention weight vector of each of the target video frames; and/or, according to the index feature vector, the first candidate feature of each of the candidate video frames a vector generates a candidate heat map for each of the candidate video frames; normalizing the candidate heat map to obtain a candidate attention weight vector for each of the candidate video frames.
  • the encoding result obtaining module is configured to obtain a coding result of each of the target video segments according to a target attention weight vector and a second target feature vector of each of the target video frames, according to each The candidate attention weight vector and the second candidate feature vector of the candidate video frame obtain an encoding result of each of the candidate video segments.
  • the encoding result obtaining module is configured to multiply a target attention weight vector of each of the target video frames by a second target feature vector of a respective target video frame; each of the target video frames The multiplied results are added in a time dimension to obtain an encoding result of each of the target video segments; and/or, a candidate attention weight vector of each of the candidate video frames and a second candidate feature vector of a respective candidate video frame Multiplying; multiplying the multiplied results of each of the candidate video frames in a time dimension to obtain an encoding result for each of the candidate video segments.
  • the determining module is configured to sequentially perform a subtraction operation on the encoding result of each of the target video segments and the encoding result of each of the candidate video segments; and the result of the subtraction operation is in each Performing a square operation on the dimension; performing a full join operation on the feature vector obtained by the square operation to obtain a two-dimensional feature vector; normalizing the two-dimensional feature vector to obtain each of the target video segments and each A similarity score between candidate video segments.
  • the identifying module is configured to add, according to each of the candidate video segments of the at least one candidate video, the similarity scores of the preset proportional thresholds with the highest score as a similarity score for each of the candidate videos; sorting the similarity scores of each of the candidate videos in descending order; determining one or several of the candidate videos arranged in front to be included with the target video Video of the same target pedestrian.
  • an electronic device includes: a processor and a memory; the memory is configured to store at least one executable instruction, the executable instruction causing the processor to execute as the first The pedestrian re-identification method described in the aspect.
  • a computer readable storage medium having stored thereon a computer program, the computer program being executed by a processor to implement the pedestrian re-identification method as described in the first aspect.
  • a computer program product comprising: at least one executable instruction, when executed by a processor, for implementing the pedestrian re-identification method according to the first aspect .
  • the embodiment of the present application acquires a target video including a target pedestrian and at least one candidate video, and respectively encodes each target video segment in the target video and each candidate video segment in the at least one candidate video. And determining a similarity score between each target video segment and each candidate video segment according to the encoding result; and performing pedestrian re-recognition on the at least one candidate video according to the similarity score. Since the video clip contains far fewer frames than the entire video, the degree of change in the pedestrian surface information in the video clip is much smaller than the change in the pedestrian surface information in the entire video.
  • each target video segment and each candidate video segment effectively reduces the change of pedestrian surface information, and utilizes pedestrians in different video frames.
  • the diversity of surface information and the dynamic correlation between video frames and video frames improve the utilization of pedestrian surface information and improve the similarity score calculation between each target video segment and each candidate video segment.
  • the accuracy rate can further improve the accuracy of pedestrian recognition.
  • FIG. 1 is a schematic flow chart of an embodiment of a pedestrian re-identification method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a computing framework of an embodiment of a pedestrian re-identification method according to an embodiment of the present application
  • FIG. 3 is a schematic flow chart of another embodiment of a pedestrian re-identification method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a note coding mechanism in a pedestrian re-identification method according to an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of an embodiment of a pedestrian re-identification device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of another embodiment of a pedestrian re-identification device according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the present application.
  • FIG. 1 a flow diagram of one embodiment of a pedestrian re-identification method in accordance with an embodiment of the present application is shown.
  • the pedestrian re-identification method of the embodiment of the present application performs the following steps by the processor of the electronic device calling the relevant instruction stored in the memory.
  • Step S100 Acquire a target video including a target pedestrian and at least one candidate video.
  • the target video in the embodiment of the present application may include one or more target pedestrians, and the candidate video may include one or more candidate pedestrians or no candidate pedestrians.
  • the target video and the at least one candidate video in the embodiment of the present application may be a video image from a video capture device, and may also be derived from other devices.
  • One of the purposes of the embodiments of the present application is to obtain candidate pedestrians from at least one candidate video.
  • the target pedestrian is a candidate video for the same pedestrian.
  • the step S100 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by an acquisition module 50 executed by the processor.
  • Step S102 encoding each of the target video segments and each of the at least one candidate video in the target video separately.
  • video segmentation is performed on the target video and the candidate video to generate each target video segment in the target video and each candidate video segment in the candidate video, wherein each target video segment has a fixed length of time, each candidate The video clips have a fixed length of time, and the length of time of each target video clip may or may not be the same as the length of time of each candidate video clip.
  • each target video segment and each candidate video segment are respectively subjected to an encoding operation to obtain an encoding result of each target video segment and an encoding result of each candidate video segment.
  • step S102 may be performed by a processor invoking a corresponding instruction stored in the memory or by an encoding module 52 executed by the processor.
  • Step S104 Determine a similarity score between each target video segment and each candidate video segment according to the encoding result.
  • the coding result of each target video segment may be considered as a representation form of the pedestrian feature vector in each target video segment, and the coding result of each candidate video segment may be considered as being in each candidate video segment.
  • the result of the encoding is the pedestrian feature vector.
  • the pedestrian feature vector between a target video segment and a candidate video segment is the same or similar, it indicates that the target video segment and the candidate video segment are more likely to include the same target pedestrian, that is, the target video segment and the target video segment
  • the similarity score between the candidate video segments is higher; if the pedestrian feature vector between a target video segment and a candidate video segment is different, it indicates that the target video segment and the candidate video segment contain the same target pedestrian The possibility is lower, that is, the similarity score between the target video segment and the candidate video segment is lower.
  • step S104 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the determining module 54 executed by the processor.
  • Step S106 Perform pedestrian recognition on the at least one candidate video according to the similarity score.
  • the similarity score of at least one candidate video may be obtained according to the similarity score.
  • a candidate video having a higher similarity score is determined to include a candidate video having the same target pedestrian as in the target video.
  • step S106 may be performed by the processor invoking a corresponding instruction stored in the memory or by the identification module 56 being executed by the processor.
  • the pedestrian re-identification method proposed in the embodiment of the present application can be executed under the calculation framework shown in FIG. 2.
  • the video (including the target video and the at least one candidate video) is cut to generate a video clip having a fixed length.
  • p represents the target video
  • g represents one of the at least one candidate video
  • p n is a target video segment in the target video p
  • g k is a candidate video segment in the candidate video g.
  • a deep network with a cooperative attention mechanism is utilized.
  • the depth network takes the target video segment p n and the candidate video segment g k as input items, and the output term m(p n , g k ) is a similarity score between the target video segment p n and the candidate video segment g k .
  • m(p n , g k ) is a similarity score between the target video segment p n and the candidate video segment g k .
  • a similarity score between several video segments can be obtained.
  • a competitive mechanism can be used to select partial similarity scores with higher similarity, and the target video p can be obtained by adding these similarity scores.
  • the embodiment of the present application acquires a target video including a target pedestrian and at least one candidate video, and respectively encodes each target video segment in the target video and each candidate video segment in the at least one candidate video. And determining a similarity score between each target video segment and each candidate video segment according to the encoding result; and performing pedestrian re-recognition on the at least one candidate video according to the similarity score. Since the video clip contains far fewer frames than the entire video, the degree of change in the pedestrian surface information in the video clip is much smaller than the change in the pedestrian surface information in the entire video.
  • each target video segment and each candidate video segment effectively reduces the change of pedestrian surface information, and utilizes pedestrians in different video frames.
  • the diversity of surface information and the dynamic correlation between video frames and video frames improve the utilization of pedestrian surface information and improve the similarity score calculation between each target video segment and each candidate video segment.
  • the accuracy rate can further improve the accuracy of pedestrian recognition.
  • FIG. 3 a flow diagram of another embodiment of a pedestrian re-identification method in accordance with an embodiment of the present application is shown.
  • Step S300 Acquire a target video including a target pedestrian and at least one candidate video.
  • Step S302 encoding each of the target video segments and each of the at least one candidate video in the target video separately.
  • this step S302 may include the following steps:
  • Step S3020 Acquire a first target feature vector and a second target feature vector of each target video frame in each target video segment and an index feature vector of each target video segment, and obtain each candidate in each candidate video segment.
  • the image feature vector of each target video frame and the image feature vector of each candidate video frame may be extracted by using a neural network, and the image feature vector is used to reflect image features in the video frame, such as pedestrian characteristics. , background features, and more.
  • the first target feature vector and the second target feature vector of each target video frame and the index feature vector of each target video segment are generated according to the image feature vector of each target video frame, and the index feature vector includes the target
  • the information of the video clip can effectively distinguish useful information from noise information.
  • the candidate video frames the first candidate feature vector and the second candidate feature vector of each candidate video frame are generated according to the image feature vector of each candidate video frame.
  • the first target feature vector (“key” feature vector) and the first candidate feature vector (“key” feature vector) may be generated according to linear transformation of each frame feature, and may be generated according to another linear transformation of each frame feature.
  • the second target feature vector (“value” feature vector) and the second candidate feature vector (“value” feature vector) may utilize a Long Short-Term Memory (LSTM) network and each of each target video segment
  • LSTM Long Short-Term Memory
  • the image feature vector of the target video frame generates an index feature vector for each target video segment, and the index feature vector is generated by the target video segment, acting on the target video segment itself and all candidate video segments.
  • Step S3022 Generate an attention weight vector according to the index feature vector, the first target feature vector, and the first candidate feature vector.
  • the first target feature vector and the first candidate feature vector are used to generate a attention weight vector.
  • the target attention weight vector of each target video frame may be generated according to the index feature vector and the first target feature vector, optionally, according to the index feature vector, each target video.
  • the first target feature vector of the frame generates a target heat map of each target video frame, specifically, performing an inner product operation according to the index feature vector and the first target feature vector of each target video frame to obtain a target heat of each target video frame.
  • Figure normalize the target heat map using the softmax function in the time dimension to obtain the target attention weight vector for each target video frame.
  • the candidate attention weight vector of each candidate video frame may be generated according to the index feature vector and the first candidate feature vector, and optionally, each of the candidate feature vectors of each candidate video frame is generated according to the index feature vector.
  • a candidate heat map of the candidate video frames specifically, performing an inner product operation according to the index feature vector and the first candidate feature vector of each candidate video frame to obtain a candidate heat map of each candidate video frame; using a softmax function in a time dimension
  • the candidate heat map is normalized to obtain a candidate attention weight vector for each candidate video frame.
  • the weight vector is used to enhance the effective pedestrian characteristics in the encoding process. It is a weight vector with discriminative information, which can reduce the influence of noise information.
  • Step S3024 Obtain an encoding result of each target video segment and an encoding result of the candidate video segment according to the attention weight vector, the second target feature vector, and the second candidate feature vector.
  • the second target feature vector is used to reflect the image feature of each frame in the target video segment
  • the second candidate feature vector is used to reflect the image feature of each frame in the candidate video segment.
  • the encoding result of each target video segment is obtained according to the target attention weight vector and the second target feature vector of each target video frame. Specifically, the target attention weight vector of each target video frame is multiplied by the second target feature vector of the respective target video frame; the multiplication results of each target video frame are added in the time dimension to obtain each target video segment. The result of the encoding.
  • the coding result of each candidate video segment is obtained according to the candidate attention weight vector and the second candidate feature vector of each candidate video frame.
  • multiplying candidate attention weight vectors of each candidate video frame by second candidate feature vectors of respective candidate video frames; adding multiplication results of each candidate video frame in a time dimension to obtain each candidate video The result of the encoding of the fragment.
  • the step S302 of the embodiment of the present application can be implemented by paying attention to the encoding mechanism, that is, obtaining the encoding result of the video segment by refining different frame features in the video segment (the target video segment and the candidate video segment), and the process thereof is as shown in FIG. 4 .
  • a convolutional neural network feature is extracted for each of the target video frame and the candidate video segment of the target video segment, and corresponding to each target video frame or each candidate video frame according to the convolutional neural network feature generation.
  • the "key" feature vector and the "value” feature vector, the "key” feature vector of each target video frame or each candidate video frame is internally coded with the index feature vector of each target video segment to form a heat map, and the heat is generated.
  • the map reflects the correlation of each feature within the target video frame or candidate video frame with global information.
  • the heat map is normalized by the softmax function in the time dimension to form a attention weight vector, which is multiplied by the "value" feature vector of each video frame in each dimension, and the different video frames are obtained.
  • the results are summed in the time dimension to obtain the encoded result for each video segment.
  • Step S304 Determine a similarity score between each target video segment and each candidate video segment according to the encoding result.
  • the coding result of each target video segment and the coding result of each candidate video segment are sequentially subjected to a subtraction operation, a square operation, a full connection operation, and a normalization operation to obtain each target video.
  • the similarity score between the segment and each candidate video segment is sequentially subtracted from the coding result of each candidate video segment, and then squared operation is performed on each image dimension, including but not limited to: pedestrian image dimension and background The image dimension, wherein the pedestrian image dimension includes a head image dimension, an upper body image dimension, a lower body image dimension, and the like; the background image dimension includes an architectural image dimension, a street image dimension, and the like.
  • the eigenvector obtained after the squaring operation obtains a two-dimensional eigenvector through the fully connected layer, and finally obtains the similarity score between each target video segment and each candidate video segment by nonlinear normalization of the Sigmoid function.
  • Step S306 Perform pedestrian recognition on the at least one candidate video according to the similarity score.
  • a similarity score greater than or equal to a preset threshold or a similar score with a higher score are added as the similarity score for each candidate video; the similarity scores of each candidate video are arranged in descending order; one or several candidate videos arranged in front are determined as The target video contains a video of the same target pedestrian.
  • the preset threshold can be set according to actual conditions, and the higher score is relatively speaking.
  • the embodiment of the present application acquires a target video including a target pedestrian and at least one candidate video, and respectively encodes each target video segment in the target video and each candidate video segment in the at least one candidate video. And determining a similarity score between each target video segment and each candidate video segment according to the encoding result; and performing pedestrian re-recognition on the at least one candidate video according to the similarity score. Since the video clip contains far fewer frames than the entire video, the degree of change in the pedestrian surface information in the video clip is much smaller than the change in the pedestrian surface information in the entire video.
  • each target video segment and each candidate video segment effectively reduces the change of pedestrian surface information, and utilizes pedestrians in different video frames.
  • the diversity of surface information and the dynamic correlation between video frames and video frames improve the utilization of pedestrian surface information and improve the similarity score calculation between each target video segment and each candidate video segment.
  • the accuracy rate can further improve the accuracy of pedestrian recognition.
  • the encoding result of the candidate video in the embodiment of the present application is obtained by the index feature vector of the target video segment and the "key" feature vector of the candidate video segment.
  • the index feature vector of the target video segment is used as the guiding information.
  • the accuracy of the coding result of the candidate video is determined to determine the similarity score.
  • the attention weight vector of each candidate video frame is estimated by using the index feature vector of the target video segment, and the influence of the abnormal candidate video frame on the coding result of the candidate video segment in the candidate video is reduced, and the pertinence of pedestrian re-identification in the candidate video is improved.
  • the target video and the candidate video are sliced, and the target video segment and the candidate video segment are encoded.
  • the candidate with higher similarity score is selected.
  • the video clip is a valid candidate video clip of the candidate video, and the candidate video clip with lower similarity score is ignored.
  • FIG. 5 there is shown a block diagram of one embodiment of a pedestrian re-identification apparatus in accordance with an embodiment of the present application.
  • the pedestrian re-identification device includes: an obtaining module 50 configured to acquire a target video including a target pedestrian and at least one candidate video; and an encoding module 52 configured to each target video segment and at least one in the target video Each candidate video segment in the candidate video is separately encoded; the determining module 54 is configured to determine a similarity score between each target video segment and each candidate video segment according to the encoding result; the similarity score is used to represent the target The degree of similarity between the video segment and the pedestrian feature in the candidate video segment; the identification module 56 is configured to perform pedestrian re-recognition on the at least one candidate video based on the similarity score.
  • the pedestrian re-identification device of the embodiment of the present invention is used to implement the corresponding pedestrian re-identification method in the above embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.
  • FIG. 6 a block diagram of another embodiment of a pedestrian re-identification apparatus in accordance with an embodiment of the present application is shown.
  • the pedestrian re-identification device includes: an obtaining module 60 configured to acquire a target video including a target pedestrian and at least one candidate video; and an encoding module 62 configured to each target video segment and at least one in the target video Each candidate video segment in the candidate video is separately encoded; the determining module 64 is configured to determine a similarity score between each target video segment and each candidate video segment according to the encoding result; the similarity score is used to represent the target The degree of similarity between the video clip and the pedestrian feature in the candidate video clip; the identification module 66 is configured to perform pedestrian re-recognition on the at least one candidate video according to the similarity score.
  • the encoding module 62 includes: a feature vector obtaining module 620 configured to acquire a first target feature vector and a second target feature vector of each target video frame in each target video segment and an index of each target video segment. And acquiring a first candidate feature vector and a second candidate feature vector for each candidate video frame in each of the candidate video segments; the weight vector generation module 622 is configured to: according to the index feature vector, the first target feature vector, and the first The candidate feature vector generates an attention weight vector; the encoding result obtaining module 624 is configured to obtain an encoding result of each target video segment and an encoding result of each candidate video segment according to the attention weight vector, the second target feature vector, and the second candidate feature vector. .
  • the feature vector obtaining module 620 is configured to separately extract an image feature vector of each target video frame and an image feature vector of each candidate video frame; and generate each target video frame according to the image feature vector of each target video frame. a first target feature vector and a second target feature vector and an index feature vector of each target video segment, and generating a first candidate feature vector and a second candidate feature of each candidate video frame according to the image feature vector of each candidate video frame vector.
  • the weight vector generating module 622 is configured to generate a target attention weight vector of each target video frame according to the index feature vector and the first target feature vector, and generate each candidate video frame according to the index feature vector and the first candidate feature vector.
  • Candidate attention weight vector is configured to generate a target attention weight vector of each target video frame according to the index feature vector and the first target feature vector, and generate each candidate video frame according to the index feature vector and the first candidate feature vector.
  • the weight vector generating module 622 is configured to generate a target heat map of each target video frame according to the index feature vector and the first target feature vector of each target video frame; normalizing the target heat map to obtain each a target attention weight vector of the target video frames; and/or, generating a candidate heat map for each candidate video frame according to the index feature vector, the first candidate feature vector of each candidate video frame; normalizing the candidate heat map A candidate attention weight vector for each candidate video frame is obtained.
  • the encoding result obtaining module 624 is configured to obtain, according to the target attention weight vector and the second target feature vector of each target video frame, the encoding result of each target video segment, according to the candidate attention weight vector of each candidate video frame.
  • the encoding result of each candidate video segment is obtained with the second candidate feature vector.
  • the encoding result obtaining module 624 is configured to multiply the target attention weight vector of each target video frame by the second target feature vector of the respective target video frame; and multiply the result of each target video frame in the time dimension. Adding, obtaining an encoding result of each target video segment; and/or multiplying a candidate attention weight vector of each candidate video frame by a second candidate feature vector of a respective candidate video frame; The multiplication results are added in the time dimension to obtain the coding result of each candidate video segment.
  • the determining module 64 is configured to sequentially perform a subtraction operation on the coding result of each target video segment and the coding result of each candidate video segment; and perform a square operation on each dimension in the result of the subtraction operation;
  • the feature vector obtained by the square operation performs a full join operation to obtain a two-dimensional feature vector;
  • the two-dimensional feature vector is normalized to obtain a similarity score between each target video segment and each candidate video segment.
  • the identifying module 66 is configured to add, for each candidate video segment of the at least one candidate video, a similarity score of the highest-precision preset proportional threshold as the similarity score of each candidate video. And sorting the similarity scores of each candidate video in descending order; determining one or several candidate videos arranged in front as the video containing the same target pedestrian as the target video.
  • the pedestrian re-identification device of the embodiment of the present invention is used to implement the corresponding pedestrian re-identification method in the above embodiments, and has the beneficial effects of the corresponding method embodiments, and details are not described herein again.
  • the embodiment of the present application further provides an electronic device, such as a mobile terminal, a personal computer (PC), a tablet computer, a server, and the like.
  • an electronic device such as a mobile terminal, a personal computer (PC), a tablet computer, a server, and the like.
  • FIG. 7 there is shown a schematic structural diagram of an electronic device 700 suitable for implementing the pedestrian re-identification device of the embodiment of the present application.
  • the electronic device 700 may include a memory and a processor.
  • electronic device 700 includes one or more processors, communication elements, etc., such as one or more central processing units (CPUs) 701, and/or one or more image processors (GPU) 713, etc., the processor may perform various appropriate actions in accordance with executable instructions stored in read only memory (ROM) 702 or executable instructions loaded from random access memory (RAM) 703 from storage portion 708.
  • the communication component includes a communication component 712 and/or a communication interface 709.
  • the communication component 712 can include, but is not limited to, a network card, and the network card can include, but is not limited to, an IB (Infiniband) network card, and the communication interface 709 includes a communication interface of a network interface card such as a local area network (LAN) card, a modem, or the like.
  • the communication interface 709 performs communication processing via a network such as the Internet.
  • the processor can communicate with read-only memory 702 and/or random access memory 703 to execute executable instructions, communicate with communication component 712 via communication bus 704, and communicate with other target devices via communication component 712, thereby completing embodiments of the present application.
  • Providing any operation corresponding to the pedestrian re-identification method for example, acquiring a target video including the target pedestrian and at least one candidate video; for each of the target video segments and at least one of the candidate videos The candidate video segments are respectively encoded; determining a similarity score between each of the target video segments and each of the candidate video segments according to the encoding result; the similarity score is used to represent the target video segment and The degree of similarity of the pedestrian features in the candidate video segments; performing pedestrian re-recognition on at least one of the candidate videos according to the similarity score.
  • RAM 703 various programs and data required for the operation of the device can be stored.
  • the CPU 701 or the GPU 713, the ROM 702, and the RAM 703 are connected to each other through a communication bus 704.
  • ROM 702 is an optional module.
  • the RAM 703 stores executable instructions or writes executable instructions to the ROM 702 at runtime, the executable instructions causing the processor to perform operations corresponding to the above-described communication methods.
  • An input/output (I/O) interface 705 is also coupled to communication bus 704.
  • the communication component 712 can be integrated or can be configured to have multiple sub-modules (e.g., multiple IB network cards) and be on a communication bus link.
  • the following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, etc.; an output portion 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion 708 including a hard disk or the like And a communication interface 709 including a network interface card such as a LAN card, modem, or the like.
  • Driver 710 is also connected to I/O interface 705 as needed.
  • a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like, is mounted on the drive 710 as needed so that a computer program read therefrom is installed into the storage portion 708 as needed.
  • FIG. 7 is only an optional implementation manner.
  • the number and type of components in FIG. 7 may be selected, deleted, added, or replaced according to actual needs;
  • Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, communication components can be separated, or integrated on the CPU or GPU. and many more.
  • the electronic device in the embodiment of the present application may be used to implement a corresponding pedestrian re-identification method in the foregoing embodiments, where each device in the electronic device may be used to perform various steps in the foregoing method embodiments, for example, the pedestrians described above.
  • the identification method can be implemented by the processor of the electronic device calling the relevant instruction stored in the memory. For brevity, no further details are provided herein.
  • embodiments of the present application include a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising the corresponding execution
  • An instruction corresponding to the method step provided by the embodiment of the present application for example, acquiring a target video including a target pedestrian and at least one candidate video; each of the target video segments and at least one of the candidate videos in the target video The candidate video segments are respectively encoded; the similarity score between each of the target video segments and each of the candidate video segments is determined according to the encoding result; the similarity score is used to represent the target video segment and the The degree of similarity of the pedestrian features in the candidate video segments; performing at least one of the candidate videos for re-recognition based on the similarity scores.
  • the computer program can be downloaded and installed from the
  • the methods and apparatus, electronic devices, and storage media of the embodiments of the present application may be implemented in many ways.
  • the methods and apparatus, electronic devices, and storage media of the embodiments of the present application can be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the embodiments of the present application are not limited to the order specifically described above unless otherwise specifically stated.
  • the present application may also be embodied as a program recorded in a recording medium, the programs including machine readable instructions for implementing a method in accordance with embodiments of the present application.
  • the present application also covers a recording medium storing a program for executing the method according to an embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

行人再识别方法、装置、电子设备和存储介质,其中,所述行人再识别方法包括:获取包含目标行人的目标视频和至少一个候选视频(S100);对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码(S102);根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值(S104);所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;根据所述相似性分值对至少一个所述候选视频进行行人再识别(S106)。提高了编码结果对每个目标视频片段和每个候选视频片段之间的相似性分值计算的准确率,进而可以提高了行人再识别的准确率。

Description

行人再识别方法、装置、电子设备和存储介质
相关申请的交叉引用
本申请基于申请号为201810145717.3、申请日为2018年2月12日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请实施例涉及图像处理技术领域,尤其涉及一种行人再识别方法、装置、电子设备和存储介质。
背景技术
行人再识别是智能视频监控系统中的一项关键技术,它旨在通过对给定目标视频和候选视频之间的相似性进行度量,进而在大量候选视频中找出与目标视频中包含同一行人的候选视频。
目前的行人再识别方法主要将一段完整的视频进行编码,利用编码结果对整段目标视频和整段候选视频之间的相似性进行度量,行人再识别的效果差。
发明内容
本申请实施例提供了行人再识别技术方案。
根据本申请实施例的第一方面,提供了一种行人再识别方法,包括:获取包含目标行人的目标视频和至少一个候选视频;对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码;根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值;所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;根据所述相似性分值对至少一个所述候选视频进行行人再识别。
在一实施例中,所述对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码,包括:获取每个所述目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,获取每个所述候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量;根据 所述索引特征向量、所述第一目标特征向量和所述第一候选特征向量生成注意权重向量;根据所述注意权重向量、所述第二目标特征向量和所述第二候选特征向量获得每个所述目标视频片段的编码结果和每个所述候选视频片段的编码结果。
在一实施例中,所述获取每个所述目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,获取每个所述候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量,包括:分别提取每个所述目标视频帧的图像特征向量和每个所述候选视频帧的图像特征向量;根据每个所述目标视频帧的图像特征向量生成每个所述目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,根据每个所述候选视频帧的图像特征向量生成每个所述候选视频帧的第一候选特征向量和第二候选特征向量。
在一实施例中,所述根据所述索引特征向量、所述第一目标特征向量和所述第一候选特征向量生成注意权重向量,包括:根据所述索引特征向量和所述第一目标特征向量生成每个所述目标视频帧的目标注意权重向量,根据所述索引特征向量和所述第一候选特征向量生成每个所述候选视频帧的候选注意权重向量。
在一实施例中,所述根据所述索引特征向量和所述第一目标特征向量生成每个所述目标视频帧的目标注意权重向量,包括:根据所述索引特征向量、每个所述目标视频帧的所述第一目标特征向量生成每个所述目标视频帧的目标热度图;对所述目标热度图进行归一化处理得到每个所述目标视频帧的目标注意权重向量;和/或,所述根据所述索引特征向量和所述第一候选特征向量生成每个所述候选视频帧的候选注意权重向量,包括:根据所述索引特征向量、每个所述候选视频帧的所述第一候选特征向量生成每个所述候选视频帧的候选热度图;对所述候选热度图进行归一化处理得到每个所述候选视频帧的候选注意权重向量。
在一实施例中,所述根据所述注意权重向量、所述第二目标特征向量和所述第二候选特征向量获得每个所述目标视频片段的编码结果和每个所述候选视频片段的编码结果,包括:根据每个所述目标视频帧的目标注意权重向量和第二目标特征向量获得每个所述目标视频片段的编码结果,根据每个所述候选视频帧的候选注意权重向量和第二候选特征向量获得每个所述候选视频片段的编码结果。
在一实施例中,所述根据每个所述目标视频帧的目标注意权重向量和第二目标特征向量获得每个所述目标视频片段的编码结果,包括:将每个所述目标视频帧的目标注意权重向量与各自目标视频帧的第二目标特征向量相乘;将每个所述目标视频帧的相乘结果在时间维度相加,得到每个所述目标视频片段的编码结果;和/或,根据每个所述目标视频帧的候选注意 权重向量和第二候选特征向量获得每个所述候选视频片段的编码结果,包括:将每个所述候选视频帧的候选注意权重向量与各自候选视频帧的第二候选特征向量相乘;将每个所述候选视频帧的相乘结果在时间维度相加,得到每个所述候选视频片段的编码结果。
在一实施例中,所述根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值,包括:将每个所述目标视频片段的编码结果与每个所述候选视频片段的编码结果依次进行相减操作;将相减操作的结果在每一个维度上进行平方操作;对平方操作得到的特征向量进行全连接操作得到二维的特征向量;将所述二维的特征向量进行归一化操作,得到每个所述目标视频片段和每个所述候选视频片段之间的相似性分值。
在一实施例中,所述根据所述相似性分值对至少一个所述候选视频进行行人再识别,包括:针对至少一个所述候选视频中的每个所述候选视频片段,将分值最高的预设比例阈值的所述相似性分值相加,作为每个所述候选视频的相似性分值;将每个所述候选视频的相似性分值按照降序进行排列;将排列在前面的一个或者几个所述候选视频确定为与所述目标视频包含同一目标行人的视频。
根据本申请实施例的第二方面,提供了一种行人再识别装置,包括:获取模块,配置为获取包含目标行人的目标视频和至少一个候选视频;编码模块,配置为对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码;确定模块,配置为根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值;所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;识别模块,配置为根据所述相似性分值对至少一个所述候选视频进行行人再识别。
在一实施例中,所述编码模块,包括:特征向量获取模块,配置为获取每个所述目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,获取每个所述候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量;权重向量生成模块,配置为根据所述索引特征向量、所述第一目标特征向量和所述第一候选特征向量生成注意权重向量;编码结果获取模块,配置为根据所述注意权重向量、所述第二目标特征向量和所述第二候选特征向量获得每个所述目标视频片段的编码结果和每个所述候选视频片段的编码结果。
在一实施例中,所述特征向量获取模块,配置为分别提取每个所述目标视频帧的图像特征向量和每个所述候选视频帧的图像特征向量;根据每个所述目标视频帧的图像特征向量生成每个所述目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量, 根据每个所述候选视频帧的图像特征向量生成每个所述候选视频帧的第一候选特征向量和第二候选特征向量。
在一实施例中,所述权重向量生成模块,配置为根据所述索引特征向量和所述第一目标特征向量生成每个所述目标视频帧的目标注意权重向量,根据所述索引特征向量和所述第一候选特征向量生成每个所述候选视频帧的候选注意权重向量。
在一实施例中,所述权重向量生成模块,配置为根据所述索引特征向量、每个所述目标视频帧的所述第一目标特征向量生成每个所述目标视频帧的目标热度图;对所述目标热度图进行归一化处理得到每个所述目标视频帧的目标注意权重向量;和/或,根据所述索引特征向量、每个所述候选视频帧的所述第一候选特征向量生成每个所述候选视频帧的候选热度图;对所述候选热度图进行归一化处理得到每个所述候选视频帧的候选注意权重向量。
在一实施例中,所述编码结果获取模块,配置为根据每个所述目标视频帧的目标注意权重向量和第二目标特征向量获得每个所述目标视频片段的编码结果,根据每个所述候选视频帧的候选注意权重向量和第二候选特征向量获得每个所述候选视频片段的编码结果。
在一实施例中,所述编码结果获取模块,配置为将每个所述目标视频帧的目标注意权重向量与各自目标视频帧的第二目标特征向量相乘;将每个所述目标视频帧的相乘结果在时间维度相加,得到每个所述目标视频片段的编码结果;和/或,将每个所述候选视频帧的候选注意权重向量与各自候选视频帧的第二候选特征向量相乘;将每个所述候选视频帧的相乘结果在时间维度相加,得到每个所述候选视频片段的编码结果。
在一实施例中,所述确定模块,配置为将每个所述目标视频片段的编码结果与每个所述候选视频片段的编码结果依次进行相减操作;将相减操作的结果在每一个维度上进行平方操作;对平方操作得到的特征向量进行全连接操作得到二维的特征向量;将所述二维的特征向量进行归一化操作,得到每个所述目标视频片段和每个所述候选视频片段之间的相似性分值。
在一实施例中,所述识别模块,配置为针对至少一个所述候选视频中的每个所述候选视频片段,将分值最高的预设比例阈值的所述相似性分值相加,作为每个所述候选视频的相似性分值;将每个所述候选视频的相似性分值按照降序进行排列;将排列在前面的一个或者几个所述候选视频确定为与所述目标视频包含同一目标行人的视频。
根据本申请实施例的第三方面,提供了一种电子设备,包括:处理器和存储器;所述存储器用于存放至少一个可执行指令,所述可执行指令使所述处理器执行如第一方面所述的行人再识别方法。
根据本申请实施例的第四方面,提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面所述 的行人再识别方法。
根据本申请实施例的第五方面,提供了一种计算机程序产品,包括:至少一个可执行指令,所述可执行指令被处理器执行时用于实现如第一方面所述的行人再识别方法。
本申请实施例在进行行人再识别时,获取包含目标行人的目标视频和至少一个候选视频,对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码,根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值;根据相似性分值对至少一个候选视频进行行人再识别。由于视频片段包含的帧数远远少于整段视频包含的帧数,因此,视频片段中的行人表面信息的变化程度远远小于整段视频中的行人表面信息的变化程度。与对整段目标视频和整段候选视频进行编码相比,对每个目标视频片段和每个候选视频片段进行编码,有效减小了行人表面信息的变化,同时利用了不同视频帧内的行人表面信息的多样性和视频帧与视频帧之间动态相关性,提高了行人表面信息的利用率,提高了编码结果对每个目标视频片段和每个候选视频片段之间的相似性分值计算的准确率,进而可以提高了行人再识别的准确率。
附图说明
图1是根据本申请实施例行人再识别方法的一个实施例的流程示意图;
图2是根据本申请实施例行人再识别方法的一个实施例的计算框架示意图;
图3是根据本申请实施例行人再识别方法的另一个实施例的流程示意图;
图4是根据本申请实施例行人再识别方法中的注意编码机制示意图;
图5是根据本申请实施例行人再识别装置的一个实施例的结构示意图;
图6是根据本申请实施例行人再识别装置的另一个实施例的结构示意图;
图7是根据本申请实施例电子设备的一个实施例的结构示意图。
具体实施方式
下面结合附图(若干附图中相同的标号表示相同的元素)和实施例,对本发明实施例的具体实施方式作进一步详细说明。以下实施例用于说明本发明,但不用来限制本发明的范围。
本领域技术人员可以理解,本发明实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。
参照图1,示出了根据本申请实施例行人再识别方法的一个实施例的流 程示意图。
本申请实施例的行人再识别方法通过电子设备的处理器调用存储器存储的相关指令执行以下步骤。
步骤S100、获取包含目标行人的目标视频和至少一个候选视频。
本申请实施例中的目标视频可以包含一个或多个目标行人,候选视频中可以包含一个或多个候选行人或者不包含候选行人。本申请实施例中的目标视频和至少一个候选视频可以是来源于视频采集设备的视频图像,还可以来源于其他设备,本申请实施例的目的之一是从至少一个候选视频中得到候选行人与目标行人为同一行人的候选视频。
在一个可选示例中,该步骤S100可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的获取模块50执行。
步骤S102、对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码。
首先,对目标视频和候选视频进行视频片段切割,生成目标视频中的每个目标视频片段和候选视频中的每个候选视频片段,其中,每个目标视频片段具有固定的时间长度,每个候选视频片段具有固定的时间长度,而且,每个目标视频片段的时间长度与每个候选视频片段的时间长度可以相同也可以不相同。
然后,分别对每个目标视频片段和每个候选视频片段进行编码操作,得到每个目标视频片段的编码结果和每个候选视频片段的编码结果。
在一个可选示例中,该步骤S102可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的编码模块52执行。
步骤S104、根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值。
本申请实施例中,每个目标视频片段的编码结果可以认为是每个目标视频片段中的行人特征向量的一种表现形式,每个候选视频片段的编码结果可以认为是每个候选视频片段中的行人特征向量的一种表现形式。或者,编码结果即行人特征向量。若某个目标视频片段与某个候选视频片段之间的行人特征向量相同或者相近,则表示该目标视频片段与该候选视频片段包含同一目标行人的可能性较高,即该目标视频片段与该候选视频片段之间的相似性分值较高;若某个目标视频片段与某个候选视频片段之间的行人特征向量不相同,则表示该目标视频片段与该候选视频片段包含同一目标行人的可能性较低,即该目标视频片段与该候选视频片段之间的相似性分值较低。
在一个可选示例中,该步骤S104可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的确定模块54执行。
步骤S106、根据相似性分值对至少一个候选视频进行行人再识别。
在得到每个目标视频片段与每个候选视频片段之间的相似性分值之 后,可以根据相似性分值获得至少一个候选视频的相似性分值。将相似性分值较高的候选视频确定为包含与目标视频中具有同一目标行人的候选视频。
在一个可选示例中,该步骤S106可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的识别模块56执行。
本申请实施例提出的行人再识别方法可以在如图2所示的计算框架下执行。首先,对视频(包括目标视频和至少一个候选视频)进行切割,生成具有固定长度的视频片段。其中p表示目标视频,g表示至少一个候选视频中的其中一个候选视频,p n是目标视频p中的一个目标视频片段,g k是候选视频g中一个候选视频片段。为了衡量目标视频p和候选视频g中任意两个视频片段的相似性,利用具有协同注意机制的深度网络。该深度网络以目标视频片段p n和候选视频片段g k作为输入项,输出项m(p n,g k)为目标视频片段p n和候选视频片段g k之间的相似性分值。对于目标视频p和候选视频g中的每两个视频片段(目标视频片段和候选视频片段),可以获得若干个视频片段之间的相似性分值。为了对目标视频p和候选视频g之间的相似性进行有效估计,可以利用竞争性机制选择相似性较高的部分相似性分值,通过对这些相似性分值的相加获得对目标视频p和候选视频g之间的相似性的可靠估计c(p,g)。
本申请实施例在进行行人再识别时,获取包含目标行人的目标视频和至少一个候选视频,对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码,根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值;根据相似性分值对至少一个候选视频进行行人再识别。由于视频片段包含的帧数远远少于整段视频包含的帧数,因此,视频片段中的行人表面信息的变化程度远远小于整段视频中的行人表面信息的变化程度。与对整段目标视频和整段候选视频进行编码相比,对每个目标视频片段和每个候选视频片段进行编码,有效减小了行人表面信息的变化,同时利用了不同视频帧内的行人表面信息的多样性和视频帧与视频帧之间动态相关性,提高了行人表面信息的利用率,提高了编码结果对每个目标视频片段和每个候选视频片段之间的相似性分值计算的准确率,进而可以提高了行人再识别的准确率。
参照图3,示出了根据本申请实施例行人再识别方法的另一个实施例的流程示意图。
需要说明的是,本申请各实施例描述的部分均有所侧重,某实施例未详尽描述的部分可参见本申请其他实施例中的介绍和说明,不再赘述。
步骤S300、获取包含目标行人的目标视频和至少一个候选视频。
步骤S302、对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码。
可选地,本步骤S302可以包括如下步骤:
步骤S3020、获取每个目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个目标视频片段的索引特征向量,获取每个候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量。
一种可选的实施方式中,可以利用神经网络提取每个目标视频帧的图像特征向量和每个候选视频帧的图像特征向量,图像特征向量用于反映视频帧中的图像特征,如行人特征、背景特征等等。针对目标视频帧,根据每个目标视频帧的图像特征向量生成每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个目标视频片段的索引特征向量,索引特征向量包含了目标视频片段的信息,能够有效分辨有用信息与噪声信息。针对候选视频帧,根据每个候选视频帧的图像特征向量生成每个候选视频帧的第一候选特征向量和第二候选特征向量。具体地,可以根据每一帧特征线性变换生成第一目标特征向量(“键”特征向量)和第一候选特征向量(“键”特征向量),可以根据每一帧特征的另一个线性变换生成第二目标特征向量(“值”特征向量)和第二候选特征向量(“值”特征向量),可以利用长短期记忆(Long Short-Term Memory,LSTM)网络和每个目标视频片段的每个目标视频帧的图像特征向量生成每个目标视频片段的索引特征向量,索引特征向量由目标视频片段生成,作用于目标视频片段自身以及所有的候选视频片段。
步骤S3022、根据索引特征向量、第一目标特征向量和第一候选特征向量生成注意权重向量。
本申请实施例中,第一目标特征向量和第一候选特征向量用于生成注意权重向量。一种可选的实施方式中,针对目标视频帧,可以根据索引特征向量和第一目标特征向量生成每个目标视频帧的目标注意权重向量,可选地,根据索引特征向量、每个目标视频帧的第一目标特征向量生成每个目标视频帧的目标热度图,具体地,根据索引特征向量、每个目标视频帧的第一目标特征向量进行内积操作得到每个目标视频帧的目标热度图;在时间维度上利用softmax函数对目标热度图进行归一化处理得到每个目标视频帧的目标注意权重向量。针对候选视频帧,可以根据索引特征向量和第一候选特征向量生成每个候选视频帧的候选注意权重向量,可选地,根据索引特征向量、每个候选视频帧的第一候选特征向量生成每个候选视频帧的候选热度图,具体地,根据索引特征向量、每个候选视频帧的第一候选特征向量进行内积操作得到每个候选视频帧的候选热度图;在时间维度上利用softmax函数对候选热度图进行归一化处理得到每个候选视频帧的候选注意权重向量。
注意权重向量用于在编码过程中增强有效的行人特征,是一种具有判别力信息的权重向量,能够减弱噪声信息的影响。
步骤S3024、根据注意权重向量、第二目标特征向量和第二候选特征向 量获得每个目标视频片段的编码结果和候选视频片段的编码结果。
本申请实施例中,第二目标特征向量用于反映目标视频片段中的每一帧的图像特征,第二候选特征向量用于反映候选视频片段中的每一帧的图像特征。一种可选的实施方式中,针对目标视频帧,根据每个目标视频帧的目标注意权重向量和第二目标特征向量获得每个目标视频片段的编码结果。具体地,将每个目标视频帧的目标注意权重向量与各自目标视频帧的第二目标特征向量相乘;将每个目标视频帧的相乘结果在时间维度相加,得到每个目标视频片段的编码结果。针对候选视频帧,根据每个候选视频帧的候选注意权重向量和第二候选特征向量获得每个候选视频片段的编码结果。可选地,将每个候选视频帧的候选注意权重向量与各自候选视频帧的第二候选特征向量相乘;将每个候选视频帧的相乘结果在时间维度相加,得到每个候选视频片段的编码结果。
本申请实施例的步骤S302可以通过注意编码机制实现,即通过对视频片段(目标视频片段和候选视频片段)中不同帧特征的提炼而获得视频片段的编码结果,其过程如图4所示。首先对目标视频片段中的每一个目标视频帧和候选视频片段中的每一个候选视频帧提取卷积神经网络特征,根据卷积神经网络特征生成与每一个目标视频帧或每一个候选视频帧对应的“键”特征向量和“值”特征向量,每一个目标视频帧或每一个候选视频帧的“键”特征向量与每个目标视频片段的索引特征向量进行内积操作形成热度图,通过热度图反映目标视频帧或候选视频帧内的每一个特征与全局信息的相关性。将热度图在时间维度上利用softmax函数进行归一化操作形成注意权重向量,该注意权重向量与每一视频帧的“值”特征向量在每一个维度上对应相乘,并把不同视频帧获得的结果在时间维度进行相加,进而获得每一个视频片段的编码结果。
步骤S304、根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值。
一种可选的实施方式中,将每个目标视频片段的编码结果与每个候选视频片段的编码结果依次进行相减操作、平方操作、全连接操作和归一化操作,得到每个目标视频片段和每个候选视频片段之间的相似性分值。具体地,将每个目标视频片段的编码结果与每个候选视频片段的编码结果依次进行相减操作,然后在每一个图像维度上进行平方操作,图像维度包括但不限于:行人图像维度和背景图像维度,其中,行人图像维度包括头部图像维度、上身图像维度、下身图像维度等;背景图像维度包括建筑图像维度、街道图像维度等。平方操作之后所得到的特征向量经过全连接层获得一个二维的特征向量,最后通过Sigmoid函数的非线性归一化得到每个目标视频片段和每个候选视频片段之间的相似性分值。
步骤S306、根据相似性分值对至少一个候选视频进行行人再识别。
一种可选的实施方式中,针对至少一个候选视频中的每个候选视频, 将大于或等于预设阈值的相似性分值或者分值较高的相似性分值(例如,排列在前20%的相似性分值)相加,作为每个候选视频的相似性分值;将每个候选视频的相似性分值按照降序进行排列;将排列在前面的一个或者几个候选视频确定为与目标视频包含同一目标行人的视频。其中,预设阈值可以根据实际情况进行设置,分值较高为相对而言。
本申请实施例在进行行人再识别时,获取包含目标行人的目标视频和至少一个候选视频,对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码,根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值;根据相似性分值对至少一个候选视频进行行人再识别。由于视频片段包含的帧数远远少于整段视频包含的帧数,因此,视频片段中的行人表面信息的变化程度远远小于整段视频中的行人表面信息的变化程度。与对整段目标视频和整段候选视频进行编码相比,对每个目标视频片段和每个候选视频片段进行编码,有效减小了行人表面信息的变化,同时利用了不同视频帧内的行人表面信息的多样性和视频帧与视频帧之间动态相关性,提高了行人表面信息的利用率,提高了编码结果对每个目标视频片段和每个候选视频片段之间的相似性分值计算的准确率,进而可以提高了行人再识别的准确率。
本申请实施例中的候选视频的编码结果是由目标视频片段的索引特征向量与候选视频片段的“键”特征向量而得,在编码过程中,利用目标视频片段的索引特征向量作为指导信息,提高了候选视频的编码结果对确定相似性分值的准确性。利用目标视频片段的索引特征向量估计每一个候选视频帧的注意权重向量,减少候选视频中异常候选视频帧对候选视频片段的编码结果的影响,提升了候选视频中行人再识别的针对性。
本申请实施例将目标视频和候选视频进行片段切割,对目标视频片段和候选视频片段进行编码,当候选视频中的行人在部分候选视频帧中被遮挡时,选择相似性分值较高的候选视频片段作为候选视频的有效候选视频片段,忽略相似性分值较低的候选视频片段。
参照图5,示出了根据本申请实施例行人再识别装置的一个实施例的结构示意图。
本申请实施例提供的行人再识别装置包括:获取模块50,配置为获取包含目标行人的目标视频和至少一个候选视频;编码模块52,配置为对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码;确定模块54,配置为根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值;相似性分值用于表征目标视频片段与候选视频片段中的行人特征的相似程度;识别模块56,配置为根据相似性分值对至少一个候选视频进行行人再识别。
本申请实施例的行人再识别装置用于实现上述实施例中相应的行人再识别方法,并具有相应的方法实施例的有益效果,在此不再赘述。
参照图6,示出了根据本申请实施例行人再识别装置的另一个实施例的结构示意图。
本申请实施例提供的行人再识别装置包括:获取模块60,配置为获取包含目标行人的目标视频和至少一个候选视频;编码模块62,配置为对目标视频中的每个目标视频片段和至少一个候选视频中的每个候选视频片段分别进行编码;确定模块64,配置为根据编码结果确定每个目标视频片段和每个候选视频片段之间的相似性分值;相似性分值用于表征目标视频片段与候选视频片段中的行人特征的相似程度;识别模块66,配置为根据相似性分值对至少一个候选视频进行行人再识别。
可选地,编码模块62包括:特征向量获取模块620,配置为获取每个目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个目标视频片段的索引特征向量,获取每个候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量;权重向量生成模块622,配置为根据索引特征向量、第一目标特征向量和第一候选特征向量生成注意权重向量;编码结果获取模块624,配置为根据注意权重向量、第二目标特征向量和第二候选特征向量获得每个目标视频片段的编码结果和每个候选视频片段的编码结果。
可选地,特征向量获取模块620,配置为分别提取每个目标视频帧的图像特征向量和每个候选视频帧的图像特征向量;根据每个目标视频帧的图像特征向量生成每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个目标视频片段的索引特征向量,根据每个候选视频帧的图像特征向量生成每个候选视频帧的第一候选特征向量和第二候选特征向量。
可选地,权重向量生成模块622,配置为根据索引特征向量和第一目标特征向量生成每个目标视频帧的目标注意权重向量,根据索引特征向量和第一候选特征向量生成每个候选视频帧的候选注意权重向量。
可选地,权重向量生成模块622,配置为根据索引特征向量、每个目标视频帧的第一目标特征向量生成每个目标视频帧的目标热度图;对目标热度图进行归一化处理得到每个目标视频帧的目标注意权重向量;和/或,根据索引特征向量、每个候选视频帧的第一候选特征向量生成每个候选视频帧的候选热度图;对候选热度图进行归一化处理得到每个候选视频帧的候选注意权重向量。
可选地,编码结果获取模块624,配置为根据每个目标视频帧的目标注意权重向量和第二目标特征向量获得每个目标视频片段的编码结果,根据每个候选视频帧的候选注意权重向量和第二候选特征向量获得每个候选视频片段的编码结果。
可选地,编码结果获取模块624,配置为将每个目标视频帧的目标注意权重向量与各自目标视频帧的第二目标特征向量相乘;将每个目标视频帧的相乘结果在时间维度相加,得到每个目标视频片段的编码结果;和/或, 将每个候选视频帧的候选注意权重向量与各自候选视频帧的第二候选特征向量相乘;将每个候选视频帧的相乘结果在时间维度相加,得到每个候选视频片段的编码结果。
可选地,确定模块64,配置为将每个目标视频片段的编码结果与每个候选视频片段的编码结果依次进行相减操作;将相减操作的结果在每一个维度上进行平方操作;对平方操作得到的特征向量进行全连接操作得到二维的特征向量;将二维的特征向量进行归一化操作,得到每个目标视频片段和每个候选视频片段之间的相似性分值。
可选地,识别模块66,配置为针对至少一个候选视频中的每个候选视频片段,将分值最高的预设比例阈值的相似性分值相加,作为每个候选视频的相似性分值;将每个候选视频的相似性分值按照降序进行排列;将排列在前面的一个或者几个候选视频确定为与目标视频包含同一目标行人的视频。
本申请实施例的行人再识别装置用于实现上述实施例中相应的行人再识别方法,并具有相应的方法实施例的有益效果,在此不再赘述。
本申请实施例还提供了一种电子设备,例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等。下面参考图7,其示出了适于用来实现本申请实施例行人再识别装置的电子设备700的结构示意图:如图7所示,电子设备700可以包括存储器和处理器。具体地,电子设备700包括一个或多个处理器、通信元件等,所述一个或多个处理器例如:一个或多个中央处理单元(CPU)701,和/或一个或多个图像处理器(GPU)713等,处理器可以根据存储在只读存储器(ROM)702中的可执行指令或者从存储部分708加载到随机访问存储器(RAM)703中的可执行指令而执行各种适当的动作和处理。通信元件包括通信组件712和/或通信接口709。其中,通信组件712可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡,通信接口709包括诸如局域网(LAN,Local Area Network)卡、调制解调器等的网络接口卡的通信接口,通信接口709经由诸如因特网的网络执行通信处理。
处理器可与只读存储器702和/或随机访问存储器703中通信以执行可执行指令,通过通信总线704与通信组件712相连、并经通信组件712与其他目标设备通信,从而完成本申请实施例提供的任一项行人再识别方法对应的操作,例如,获取包含目标行人的目标视频和至少一个候选视频;对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码;根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值;所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;根据所述相似性分值对至少一个所述候选视频进行行人再识别。
此外,在RAM703中,还可存储有装置操作所需的各种程序和数据。 CPU701或GPU713、ROM702以及RAM703通过通信总线704彼此相连。在有RAM703的情况下,ROM702为可选模块。RAM703存储可执行指令,或在运行时向ROM702中写入可执行指令,可执行指令使处理器执行上述通信方法对应的操作。输入/输出(I/O)接口705也连接至通信总线704。通信组件712可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在通信总线链接上。
以下部件连接至I/O接口705:包括键盘、鼠标等的输入部分706;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分707;包括硬盘等的存储部分708;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信接口709。驱动器710也根据需要连接至I/O接口705。可拆卸介质711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器710上,以便于从其上读出的计算机程序根据需要被安装入存储部分708。
需要说明的,如图7所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图7的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,通信元件可分离设置,也可集成设置在CPU或GPU上,等等。这些可替换的实施方式均落入本发明的保护范围。
本申请实施例的电子设备可以用于实现上述实施例中相应的行人再识别方法,该电子设备中的各个器件可以用于执行上述方法实施例中的各个步骤,例如,上文中描述的行人再识别方法可以通过电子设备的处理器调用存储器存储的相关指令来实现,为了简洁,在此不再赘述。
根据本申请实施例,上文参考流程图描述的过程可以被实现为计算机程序产品。例如,本申请实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,获取包含目标行人的目标视频和至少一个候选视频;对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码;根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值;所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;根据所述相似性分值对至少一个所述候选视频进行行人再识别。在这样的实施例中,该计算机程序可以通过通信元件从网络上被下载和安装,和/或从可拆卸介质711被安装。在该计算机程序被处理器执行时,执行本申请实施例的方法中公开的功能。
可能以许多方式来实现本申请实施例的方法和装置、电子设备和存储介质。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合 来实现本申请实施例的方法和装置、电子设备和存储介质。用于方法的步骤的上述顺序仅是为了进行说明,本申请实施例的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本申请实施为记录在记录介质中的程序,这些程序包括用于实现根据本申请实施例的方法的机器可读指令。因而,本申请还覆盖存储用于执行根据本申请实施例的方法的程序的记录介质。
本申请实施例的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本发明限于所公开的形式,很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本申请的原理和实际应用,并且使本领域的普通技术人员能够理解本申请从而设计适于特定用途的带有各种修改的各种实施例。

Claims (21)

  1. 一种行人再识别方法,包括:
    获取包含目标行人的目标视频和至少一个候选视频;
    对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码;
    根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值;所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;
    根据所述相似性分值对至少一个所述候选视频进行行人再识别。
  2. 根据权利要求1所述的方法,其中,对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码,包括:
    获取每个所述目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,获取每个所述候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量;
    根据所述索引特征向量、所述第一目标特征向量和所述第一候选特征向量生成注意权重向量;
    根据所述注意权重向量、所述第二目标特征向量和所述第二候选特征向量获得每个所述目标视频片段的编码结果和每个所述候选视频片段的编码结果。
  3. 根据权利要求2所述的方法,其中,所述获取每个所述目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,获取每个所述候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量,包括:
    分别提取每个所述目标视频帧的图像特征向量和每个所述候选视频帧的图像特征向量;
    根据每个所述目标视频帧的图像特征向量生成每个所述目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,根据每个所述候选视频帧的图像特征向量生成每个所述候选视频帧的第一候选特征向量和第二候选特征向量。
  4. 根据权利要求2或3所述的方法,其中,所述根据所述索引特征向量、所述第一目标特征向量和所述第一候选特征向量生成注意权重向量,包括:
    根据所述索引特征向量和所述第一目标特征向量生成每个所述目标视频帧的目标注意权重向量,根据所述索引特征向量和所述第一候选特征向 量生成每个所述候选视频帧的候选注意权重向量。
  5. 根据权利要求4所述的方法,其中,所述根据所述索引特征向量和所述第一目标特征向量生成每个所述目标视频帧的目标注意权重向量,包括:
    根据所述索引特征向量、每个所述目标视频帧的所述第一目标特征向量生成每个所述目标视频帧的目标热度图;
    对所述目标热度图进行归一化处理得到每个所述目标视频帧的目标注意权重向量;
    和/或,
    所述根据所述索引特征向量和所述第一候选特征向量生成每个所述候选视频帧的候选注意权重向量,包括:
    根据所述索引特征向量、每个所述候选视频帧的所述第一候选特征向量生成每个所述候选视频帧的候选热度图;
    对所述候选热度图进行归一化处理得到每个所述候选视频帧的候选注意权重向量。
  6. 根据权利要求2-5任一项所述的方法,其中,所述根据所述注意权重向量、所述第二目标特征向量和所述第二候选特征向量获得每个所述目标视频片段的编码结果和每个所述候选视频片段的编码结果,包括:
    根据每个所述目标视频帧的目标注意权重向量和第二目标特征向量获得每个所述目标视频片段的编码结果,根据每个所述候选视频帧的候选注意权重向量和第二候选特征向量获得每个所述候选视频片段的编码结果。
  7. 根据权利要求6所述的方法,其中,所述根据每个所述目标视频帧的目标注意权重向量和第二目标特征向量获得每个所述目标视频片段的编码结果,包括:
    将每个所述目标视频帧的目标注意权重向量与各自目标视频帧的第二目标特征向量相乘;
    将每个所述目标视频帧的相乘结果在时间维度相加,得到每个所述目标视频片段的编码结果;
    和/或,
    所述根据每个所述目标视频帧的候选注意权重向量和第二候选特征向量获得每个所述候选视频片段的编码结果,包括:
    将每个所述候选视频帧的候选注意权重向量与各自候选视频帧的第二候选特征向量相乘;
    将每个所述候选视频帧的相乘结果在时间维度相加,得到每个所述候选视频片段的编码结果。
  8. 根据权利要求1-7中任一项所述的方法,其中,所述根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值,包括:
    将每个所述目标视频片段的编码结果与每个所述候选视频片段的编码结果依次进行相减操作;
    将相减操作的结果在每一个维度上进行平方操作;
    对平方操作得到的特征向量进行全连接操作得到二维的特征向量;
    将所述二维的特征向量进行归一化操作,得到每个所述目标视频片段和每个所述候选视频片段之间的相似性分值。
  9. 根据权利要求1-8中任一项所述的方法,其中,所述根据所述相似性分值对至少一个所述候选视频进行行人再识别,包括:
    针对至少一个所述候选视频中的每个所述候选视频片段,将分值最高的预设比例阈值的所述相似性分值相加,作为每个所述候选视频的相似性分值;
    将每个所述候选视频的相似性分值按照降序进行排列;
    将排列在前面的一个或者几个所述候选视频确定为与所述目标视频包含同一目标行人的视频。
  10. 一种行人再识别装置,包括:
    获取模块,配置为获取包含目标行人的目标视频和至少一个候选视频;
    编码模块,配置为对所述目标视频中的每个目标视频片段和至少一个所述候选视频中的每个候选视频片段分别进行编码;
    确定模块,配置为根据编码结果确定每个所述目标视频片段和每个所述候选视频片段之间的相似性分值;所述相似性分值用于表征所述目标视频片段与所述候选视频片段中的行人特征的相似程度;
    识别模块,配置为根据所述相似性分值对至少一个所述候选视频进行行人再识别。
  11. 根据权利要求10所述的装置,其中,所述编码模块,包括:
    特征向量获取模块,配置为获取每个所述目标视频片段中的每个目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,获取每个所述候选视频片段中的每个候选视频帧的第一候选特征向量和第二候选特征向量;
    权重向量生成模块,配置为根据所述索引特征向量、所述第一目标特征向量和所述第一候选特征向量生成注意权重向量;
    编码结果获取模块,配置为根据所述注意权重向量、所述第二目标特征向量和所述第二候选特征向量获得每个所述目标视频片段的编码结果和每个所述候选视频片段的编码结果。
  12. 根据权利要求11所述的装置,其中,所述特征向量获取模块,配置为分别提取每个所述目标视频帧的图像特征向量和每个所述候选视频帧的图像特征向量;根据每个所述目标视频帧的图像特征向量生成每个所述目标视频帧的第一目标特征向量和第二目标特征向量以及每个所述目标视频片段的索引特征向量,根据每个所述候选视频帧的图像特征向量生成每 个所述候选视频帧的第一候选特征向量和第二候选特征向量。
  13. 根据权利要求11或12所述的装置,其中,所述权重向量生成模块,配置为根据所述索引特征向量和所述第一目标特征向量生成每个所述目标视频帧的目标注意权重向量,根据所述索引特征向量和所述第一候选特征向量生成每个所述候选视频帧的候选注意权重向量。
  14. 根据权利要求13所述的装置,其中,所述权重向量生成模块,配置为根据所述索引特征向量、每个所述目标视频帧的所述第一目标特征向量生成每个所述目标视频帧的目标热度图;对所述目标热度图进行归一化处理得到每个所述目标视频帧的目标注意权重向量;和/或,根据所述索引特征向量、每个所述候选视频帧的所述第一候选特征向量生成每个所述候选视频帧的候选热度图;对所述候选热度图进行归一化处理得到每个所述候选视频帧的候选注意权重向量。
  15. 根据权利要求11-14任一项所述的装置,其中,所述编码结果获取模块,配置为根据每个所述目标视频帧的目标注意权重向量和第二目标特征向量获得每个所述目标视频片段的编码结果,根据每个所述候选视频帧的候选注意权重向量和第二候选特征向量获得每个所述候选视频片段的编码结果。
  16. 根据权利要求15所述的装置,其中,所述编码结果获取模块,配置为将每个所述目标视频帧的目标注意权重向量与各自目标视频帧的第二目标特征向量相乘;将每个所述目标视频帧的相乘结果在时间维度相加,得到每个所述目标视频片段的编码结果;和/或,将每个所述候选视频帧的候选注意权重向量与各自候选视频帧的第二候选特征向量相乘;将每个所述候选视频帧的相乘结果在时间维度相加,得到每个所述候选视频片段的编码结果。
  17. 根据权利要求10-16中任一项所述的装置,其中,所述确定模块,配置为将每个所述目标视频片段的编码结果与每个所述候选视频片段的编码结果依次进行相减操作;将相减操作的结果在每一个维度上进行平方操作;对平方操作得到的特征向量进行全连接操作得到二维的特征向量;将所述二维的特征向量进行归一化操作,得到每个所述目标视频片段和每个所述候选视频片段之间的相似性分值。
  18. 根据权利要求10-17中任一项所述的装置,其中,所述识别模块,配置为针对至少一个所述候选视频中的每个所述候选视频片段,将分值最高的预设比例阈值的所述相似性分值相加,作为每个所述候选视频的相似性分值;将每个所述候选视频的相似性分值按照降序进行排列;将排列在前面的一个或者几个所述候选视频确定为与所述目标视频包含同一目标行人的视频。
  19. 一种电子设备,包括:处理器和存储器;
    所述存储器用于存放至少一个可执行指令,所述可执行指令使所述处 理器执行如权利要求1-9任一项所述的行人再识别方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1-9任一项所述的行人再识别方法。
  21. 一种计算机程序产品,包括至少一个可执行指令,所述可执行指令被处理器执行时用于实现如权利要求1-9任一项所述的行人再识别方法。
PCT/CN2018/116600 2018-02-12 2018-11-21 行人再识别方法、装置、电子设备和存储介质 WO2019153830A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020197038764A KR102348002B1 (ko) 2018-02-12 2018-11-21 보행자 재식별 방법, 장치, 전자 기기 및 저장 매체
SG11201913733QA SG11201913733QA (en) 2018-02-12 2018-11-21 Pedestrian re-identification method and apparatus, electronic device, and storage medium
JP2019570048A JP6905601B2 (ja) 2018-02-12 2018-11-21 歩行者再認識方法、装置、電子機器および記憶媒体
US16/726,878 US11301687B2 (en) 2018-02-12 2019-12-25 Pedestrian re-identification methods and apparatuses, electronic devices, and storage media
PH12020500050A PH12020500050A1 (en) 2018-02-12 2020-01-06 Pedestrian re-identification method and apparatus, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810145717.3 2018-02-12
CN201810145717.3A CN108399381B (zh) 2018-02-12 2018-02-12 行人再识别方法、装置、电子设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/726,878 Continuation US11301687B2 (en) 2018-02-12 2019-12-25 Pedestrian re-identification methods and apparatuses, electronic devices, and storage media

Publications (1)

Publication Number Publication Date
WO2019153830A1 true WO2019153830A1 (zh) 2019-08-15

Family

ID=63096438

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116600 WO2019153830A1 (zh) 2018-02-12 2018-11-21 行人再识别方法、装置、电子设备和存储介质

Country Status (7)

Country Link
US (1) US11301687B2 (zh)
JP (1) JP6905601B2 (zh)
KR (1) KR102348002B1 (zh)
CN (1) CN108399381B (zh)
PH (1) PH12020500050A1 (zh)
SG (1) SG11201913733QA (zh)
WO (1) WO2019153830A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827312A (zh) * 2019-11-12 2020-02-21 北京深境智能科技有限公司 一种基于协同视觉注意力神经网络的学习方法
CN111538861A (zh) * 2020-04-22 2020-08-14 浙江大华技术股份有限公司 基于监控视频进行图像检索的方法、装置、设备及介质
CN111723645A (zh) * 2020-04-24 2020-09-29 浙江大学 用于同相机内有监督场景的多相机高精度行人重识别方法
CN115150663A (zh) * 2022-07-01 2022-10-04 北京奇艺世纪科技有限公司 热度曲线的生成方法、装置、电子设备及存储介质

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399381B (zh) 2018-02-12 2020-10-30 北京市商汤科技开发有限公司 行人再识别方法、装置、电子设备和存储介质
JP7229698B2 (ja) * 2018-08-20 2023-02-28 キヤノン株式会社 情報処理装置、情報処理方法及びプログラム
CN111523569B (zh) * 2018-09-04 2023-08-04 创新先进技术有限公司 一种用户身份确定方法、装置及电子设备
CN109543537B (zh) * 2018-10-23 2021-03-23 北京市商汤科技开发有限公司 重识别模型增量训练方法及装置、电子设备和存储介质
CN110083742B (zh) * 2019-04-29 2022-12-06 腾讯科技(深圳)有限公司 一种视频查询方法和装置
CN110175527B (zh) * 2019-04-29 2022-03-25 北京百度网讯科技有限公司 行人再识别方法及装置、计算机设备及可读介质
US11062455B2 (en) * 2019-10-01 2021-07-13 Volvo Car Corporation Data filtering of image stacks and video streams
CN111339849A (zh) * 2020-02-14 2020-06-26 北京工业大学 一种融合行人属性的行人重识别的方法
CN111339360B (zh) * 2020-02-24 2024-03-26 北京奇艺世纪科技有限公司 视频处理方法、装置、电子设备及计算机可读存储介质
CN111539341B (zh) * 2020-04-26 2023-09-22 香港中文大学(深圳) 目标定位方法、装置、电子设备和介质
CN112001243A (zh) * 2020-07-17 2020-11-27 广州紫为云科技有限公司 一种行人重识别数据标注方法、装置及设备
CN111897993A (zh) * 2020-07-20 2020-11-06 杭州叙简科技股份有限公司 一种基于行人再识别的高效目标人物轨迹生成方法
CN112069952B (zh) 2020-08-25 2024-10-15 北京小米松果电子有限公司 视频片段提取方法、视频片段提取装置及存储介质
CN112150514A (zh) * 2020-09-29 2020-12-29 上海眼控科技股份有限公司 视频的行人轨迹追踪方法、装置、设备及存储介质
CN112906483B (zh) * 2021-01-25 2024-01-23 中国银联股份有限公司 一种目标重识别方法、装置及计算机可读存储介质
CN113221641B (zh) * 2021-04-01 2023-07-07 哈尔滨工业大学(深圳) 基于生成对抗网络和注意力机制的视频行人重识别方法
CN113011395B (zh) * 2021-04-26 2023-09-01 深圳市优必选科技股份有限公司 一种单阶段动态位姿识别方法、装置和终端设备
CN113255598B (zh) * 2021-06-29 2021-09-28 南京视察者智能科技有限公司 一种基于Transformer的行人重识别方法
CN113780066B (zh) * 2021-07-29 2023-07-25 苏州浪潮智能科技有限公司 行人重识别方法、装置、电子设备及可读存储介质
CN117522454B (zh) * 2024-01-05 2024-04-16 北京文安智能技术股份有限公司 一种工作人员识别方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150131858A1 (en) * 2013-11-13 2015-05-14 Fujitsu Limited Tracking device and tracking method
CN105518744A (zh) * 2015-06-29 2016-04-20 北京旷视科技有限公司 行人再识别方法及设备
CN106022220A (zh) * 2016-05-09 2016-10-12 西安北升信息科技有限公司 一种体育视频中对参赛运动员进行多人脸跟踪的方法
CN107346409A (zh) * 2016-05-05 2017-11-14 华为技术有限公司 行人再识别方法和装置
CN108399381A (zh) * 2018-02-12 2018-08-14 北京市商汤科技开发有限公司 行人再识别方法、装置、电子设备和存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6567116B1 (en) * 1998-11-20 2003-05-20 James A. Aman Multiple object tracking system
JP5837484B2 (ja) * 2010-05-26 2015-12-24 パナソニック インテレクチュアル プロパティ コーポレーション オブアメリカPanasonic Intellectual Property Corporation of America 画像情報処理装置
KR20140090795A (ko) * 2013-01-10 2014-07-18 한국전자통신연구원 다중 카메라 환경에서 객체 추적 방법 및 장치
CN103810476B (zh) * 2014-02-20 2017-02-01 中国计量学院 基于小群体信息关联的视频监控网络中行人重识别方法
CN105095475B (zh) * 2015-08-12 2018-06-19 武汉大学 基于两级融合的不完整属性标记行人重识别方法与系统
CN105354548B (zh) * 2015-10-30 2018-10-26 武汉大学 一种基于ImageNet检索的监控视频行人重识别方法
JP2017167970A (ja) * 2016-03-17 2017-09-21 株式会社リコー 画像処理装置、物体認識装置、機器制御システム、画像処理方法およびプログラム
JP6656987B2 (ja) * 2016-03-30 2020-03-04 株式会社エクォス・リサーチ 画像認識装置、移動体装置、及び画像認識プログラム
EP3549063A4 (en) * 2016-12-05 2020-06-24 Avigilon Corporation APPEARANCE SEARCH SYSTEM AND METHOD

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150131858A1 (en) * 2013-11-13 2015-05-14 Fujitsu Limited Tracking device and tracking method
CN105518744A (zh) * 2015-06-29 2016-04-20 北京旷视科技有限公司 行人再识别方法及设备
CN107346409A (zh) * 2016-05-05 2017-11-14 华为技术有限公司 行人再识别方法和装置
CN106022220A (zh) * 2016-05-09 2016-10-12 西安北升信息科技有限公司 一种体育视频中对参赛运动员进行多人脸跟踪的方法
CN108399381A (zh) * 2018-02-12 2018-08-14 北京市商汤科技开发有限公司 行人再识别方法、装置、电子设备和存储介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827312A (zh) * 2019-11-12 2020-02-21 北京深境智能科技有限公司 一种基于协同视觉注意力神经网络的学习方法
CN110827312B (zh) * 2019-11-12 2023-04-28 北京深境智能科技有限公司 一种基于协同视觉注意力神经网络的学习方法
CN111538861A (zh) * 2020-04-22 2020-08-14 浙江大华技术股份有限公司 基于监控视频进行图像检索的方法、装置、设备及介质
CN111538861B (zh) * 2020-04-22 2023-08-15 浙江大华技术股份有限公司 基于监控视频进行图像检索的方法、装置、设备及介质
CN111723645A (zh) * 2020-04-24 2020-09-29 浙江大学 用于同相机内有监督场景的多相机高精度行人重识别方法
CN111723645B (zh) * 2020-04-24 2023-04-18 浙江大学 用于同相机内有监督场景的多相机高精度行人重识别方法
CN115150663A (zh) * 2022-07-01 2022-10-04 北京奇艺世纪科技有限公司 热度曲线的生成方法、装置、电子设备及存储介质
CN115150663B (zh) * 2022-07-01 2023-12-15 北京奇艺世纪科技有限公司 热度曲线的生成方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
KR102348002B1 (ko) 2022-01-06
PH12020500050A1 (en) 2020-11-09
KR20200015610A (ko) 2020-02-12
SG11201913733QA (en) 2020-01-30
US11301687B2 (en) 2022-04-12
CN108399381B (zh) 2020-10-30
US20200134321A1 (en) 2020-04-30
JP6905601B2 (ja) 2021-07-21
JP2020525901A (ja) 2020-08-27
CN108399381A (zh) 2018-08-14

Similar Documents

Publication Publication Date Title
WO2019153830A1 (zh) 行人再识别方法、装置、电子设备和存储介质
US10810748B2 (en) Multiple targets—tracking method and apparatus, device and storage medium
WO2018157735A1 (zh) 目标跟踪方法、系统及电子设备
CN108154222B (zh) 深度神经网络训练方法和系统、电子设备
US11586842B2 (en) System and method for machine learning based video quality assessment
CN107273458B (zh) 深度模型训练方法及装置、图像检索方法及装置
WO2022062344A1 (zh) 压缩视频的显著性目标检测方法、系统、设备及存储介质
CN109118420B (zh) 水印识别模型建立及识别方法、装置、介质及电子设备
CN115294332B (zh) 一种图像处理方法、装置、设备和存储介质
US11282179B2 (en) System and method for machine learning based video quality assessment
CN108108769B (zh) 一种数据的分类方法、装置及存储介质
CN114429577B (zh) 一种基于高置信标注策略的旗帜检测方法及系统及设备
WO2020151300A1 (zh) 基于深度残差网络的性别识别方法、装置、介质和设备
Zhang et al. A review of small target detection based on deep learning
US20230252683A1 (en) Image processing device, image processing method, and computer-readable recording medium storing image processing program
CN115035463B (zh) 行为识别方法、装置、设备和存储介质
CN111523399A (zh) 敏感视频检测及装置
CN114724144B (zh) 文本识别方法、模型的训练方法、装置、设备及介质
Meng et al. Structure preservation adversarial network for visual domain adaptation
Schulz et al. Identity documents image quality assessment
CN113642443A (zh) 模型的测试方法、装置、电子设备及存储介质
Guesdon et al. Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation
CN113780268B (zh) 商标识别方法、装置与电子设备
CN113627341B (zh) 一种视频样例比对的方法、系统、设备及存储介质
CN115331062B (zh) 图像识别方法、装置、电子设备和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18905247

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019570048

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20197038764

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26/11/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18905247

Country of ref document: EP

Kind code of ref document: A1