WO2023284182A1 - Training method for recognizing moving target, method and device for recognizing moving target - Google Patents

Training method for recognizing moving target, method and device for recognizing moving target Download PDF

Info

Publication number
WO2023284182A1
WO2023284182A1 PCT/CN2021/128515 CN2021128515W WO2023284182A1 WO 2023284182 A1 WO2023284182 A1 WO 2023284182A1 CN 2021128515 W CN2021128515 W CN 2021128515W WO 2023284182 A1 WO2023284182 A1 WO 2023284182A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
target
layer
class
consecutive images
Prior art date
Application number
PCT/CN2021/128515
Other languages
French (fr)
Inventor
Jiang Zhang
Jun Yin
Mingwei Zhou
Xingming Zhang
Original Assignee
Zhejiang Dahua Technology Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co., Ltd. filed Critical Zhejiang Dahua Technology Co., Ltd.
Publication of WO2023284182A1 publication Critical patent/WO2023284182A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition

Definitions

  • the present disclosure relates to the field of computer vision and machine learning, and in particular to a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target.
  • Recognizing a moving target refers to recognizing a pedestrian target in an image, wherein the image is captured while the pedestrian is walking.
  • a relatively developed method for recognizing the pedestrian may include two types, a person re-identification method and a gait recognition method.
  • the former method may extract static external features, such as dressing of the pedestrian, a hair style of the pedestrian, a backpack of the pedestrian, an umbrella of the pedestrian, and the like, from the image.
  • the latter method may learn dynamic features, such as a walking posture, an amplitude of arm swinging, head shaking and shoulder shrugging, sensitivity of a motor nerve, and the like, based on continuous movements of the pedestrian.
  • the applicant discovers that, when performing the method in the art to recognize the moving target, one single feature is relied on, such as a static RGB image, or a contour image, and the like. Robustness of the feature is not sufficient. Therefore, accuracy of a recognition result may be low.
  • some technical solutions in the art recognize the moving target recognition based on feature fusion. For example, global features of an RGB image may be fused with local features of the RGB image. In this way, feature modality is relatively unitary. Performance of an apparatus may be scarified, whereas accuracy of prefabricated matching may not be improved.
  • the present disclosure provides a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target. In this way, robustness and accuracy of recognizing the moving target may be improved.
  • a training method of recognizing a moving target includes: obtaining a plurality of consecutive images; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; and inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
  • obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images includes: obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.
  • the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images includes: segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; and segmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
  • fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features includes: fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.
  • the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training includes: inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.
  • a method for recognizing a moving target includes: obtaining a plurality of consecutive images of a target to be recognized; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; and obtaining a recognition result based on the dynamic features.
  • the obtaining a recognition result based on the dynamic features includes: calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one; placing the cosine similarity in an order and obtaining a maximum cosine similarity; determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; and obtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.
  • the method before the obtaining a plurality of consecutive images of a target to be recognized, the method further includes: establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.
  • an apparatus for recognizing a moving target includes a memory and a processor coupled to the memory.
  • the memory stores program instructions, and the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of the above embodiments.
  • a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target includes: obtaining a plurality of images taken at various time points; obtaining a first class of static features and a second class of static features of the target in each of the plurality of images; fusing the first class of static features and the second class of static features in each of the plurality of images to obtain a fused feature; performing classification training on the fused feature of at least some of the plurality of images until the entire network is converged.
  • the two classes of static features in one image are extracted, spliced and fused.
  • a plurality of consecutive fused features are input to a classification trainer.
  • FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure.
  • FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure.
  • FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure.
  • FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure.
  • FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure.
  • FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure.
  • FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
  • FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
  • FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure.
  • FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure.
  • the method includes following operations.
  • a pedestrian area is labeled as 255, and a background area is labeled as 0.
  • the RGB images and the contour image of the same person are labeled with identity information. So far, by performing the above operations, a standard set of RGB images and a standard set of contour images are obtained based on a same set of template RGB images. Further, consecutive RGB images and consecutive contour images cooperatively constitute the plurality of consecutive images.
  • the first class of static features of the target is obtained based on detailed features in the RGB images obtained in the operation S101, such as a dressing feature, a hairstyle feature, a backpack feature, and the like.
  • the second class of static features of the target is obtained based on the contour image obtained in the operation S101.
  • the first class of static features in the operation S102 refers to fine-grained static features of the target in each image
  • the second class of static features refers to the fine-grained contour features.
  • coarse-grained static features and coarse-grained contour features of the target in each image may be extracted, serving as the first class of static features and the second class of static features, respectively. Recognition of the moving target may also be achieved in this way.
  • FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure.
  • the operation S102 may include following operations.
  • the moving target is segmented into a plurality of portions, the plurality of portions are successively input into a first input end of an inner layer of a two-layer Vision Transformer (ViT) feature fusion model to obtain the fine-grained static features.
  • ViT Vision Transformer
  • the ViT -based two-layer feature fusion model may process image sequence data where the target is continuously shown.
  • a ViT algorithm for training and inferencing may generate a small computation amount, and the ViT algorithm may be light weighting.
  • the static features corresponding to the target may also be obtained by applying a feature fusion model based on the convolutional neural network algorithm to inference and compute the image.
  • FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure.
  • the target may be segmented firstly.
  • the RGB image may be segmented into 6 portions in an order of a head of the target, a middle-half of the target, and a lower-half of the target, and the 6 portions are equally sized. Subsequently, the 6 portions are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model, i.e., input into an RGB image input end, such that the fined-grained static features of the target are obtained.
  • a contour of the target are segmented into a plurality of portions by the means mentioned in the above, and the plurality of portions are input into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
  • the contour of the target is segmented into 6 portions that are equally sized. Subsequently, the 6 portions are successively input into the second input end of the inner layer of the two-layer ViT feature fusion model, i.e., a contour image input end, to obtain the fined-grained contour features of the target.
  • the first class of static features and the second class of static features in each image are fused to obtain the fused feature.
  • the first class of static features which are obtained based on one RGB image and one contour image
  • the second class of static features which are obtained based on one RGB image and one contour image
  • the fine-grained static features and the fine-grained contour features are fused by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused feature.
  • a weight factor of the fine-grained static features is set to be 0.5
  • a weight factor of the fine-grained contour features is 0.5.
  • the fused features is a sum of a product of 0.5 and the fine-grained static features and a product of 0.5 and the fine-grained contour features.
  • classification training is performed on the fused features of at least some of the images until the entire network is converged.
  • the at least some of the images refer to some consecutive frames of images selected from all of the plurality of images obtained in the operation S101.
  • the fused features corresponding to the some consecutive frames of images may express the dynamic features of the target while the target is walking, such that an expression ability of the model may be improved.
  • Preferably, five consecutive frames of RGB images and contour images are selected for classification training. In this way, the accuracy of the recognition result may be ensured, and the amount of computation may be reduced as much as possible.
  • fused features of the five frames of images are successively input to the input end of the outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
  • classification training based on normalized exponential loss may be applied, wherein dimension of an embedding layer is set to be positive integer times of 128, such as 128, 512, 1024, and the like, until the entire network is converged to obtain a recognition result of the moving target that meets a predefined condition.
  • the fine-grained static features and the fine-grained contour features are extracted from one RGB image and one contour image.
  • the two-layer ViT feature fusion model may be applied to fuse the three types of features. In this way, the final trained model has a stronger feature expression ability, higher robustness and a better differentiation ability. Applying the model to recognize the moving target may improve the accuracy of the recognition result.
  • FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure.
  • the method for recognizing the moving target according to the embodiment of the present disclosure includes following operations.
  • a video that shows the target to be recognized is moving, is obtained and pre-processed firstly.
  • a target RGB image sequence is obtained by a pedestrian detection and tracking tool.
  • the RGB images are then normalized to obtain a standard target RGB image sequence.
  • the standard target RGB image sequence is copied, and a front background and a rear background of the target are annotated to obtain the target contour image.
  • the RGB images and the contour image obtained in the operation S301 are segmented in a same manner and are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain fine-grained static features and the fine-grained contour features.
  • the operation S303 is similar to the operation S103 in FIG. 1.
  • the operation S303 will not be repeatedly described for providing a concise description.
  • the fused feature of at least some of the images are fused to obtain the dynamic features.
  • the fused feature corresponding to the plurality of consecutive frames of images are input to the input end of the outer layer of the two-layer ViT feature fusion model and are fused to obtain the dynamic features corresponding to the target to be recognized.
  • the dimension of the embedding layer is set to be 1024, and the output dynamic features are represented by a 1024-dimension feature vector.
  • the recognition result is obtained based on the dynamic features.
  • FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure.
  • the operation S305 may include following operations.
  • 100 features are stored in the base library of the moving target.
  • the dynamic features of the target to be recognized are compared to each of the 100 stored features one by one, and the cosine similarity therebetween is calculated. At last, 100 cosine similarity values are obtained.
  • the cosine similarity values are placed in an order, and a maximum cosine similarity value is obtained.
  • the above 100 cosine similarity values are placed in the order, such that the maximum cosine similarity value is obtained.
  • the method before performing the operation S401, the method further includes a process of establishing the base library of the moving target.
  • FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure.
  • the process of establishing the base library of the moving target includes following operations.
  • each of the all videos is pre-processed, and a plurality of consecutive images in each of the all videos are obtained successively.
  • the plurality of images are input into the trained two-layer ViT feature fusion model to obtain the dynamic features corresponding to each pedestrian target to be stored.
  • mapping relationship between each pedestrian to be stored and corresponding dynamic features is constructed, and the mapping relationship is stored into the base library of the moving target.
  • the fine-grained static features and the fine-grained contour features in one RGB image and one contour image are extracted.
  • the two classes of static features are fully utilized, and the dynamic features of pedestrians included in a sequence of consecutive frames in the video are focused, such that the problem of the feature modality in the art being unitary may be solved.
  • the two-layer ViT feature fusion model may be applied to fuse the three types of features, effectively improving the accuracy of recognition result.
  • FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
  • the apparatus includes an obtaining module 10, a fusing module 12 and a training module 14.
  • the obtaining module 10 is configured to obtain a plurality of images taken at various time points and to obtain the first class of static features and the second class of static features of the target in each of the plurality of images.
  • the fusing module 12 is configured to fuse the first class of static features and the second class of static features in each of the plurality of images to obtain the fused feature.
  • the training module 14 is configured to perform classification training on the fused feature of at least some of the plurality of images until the entire network is converged.
  • FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
  • the apparatus 20 includes a memory 100 and a processor 102 coupled to the memory 100.
  • Program instructions are stored in the memory 100.
  • the processor 102 is configured to execute the program instructions to implement the method according to any one of embodiments in the above.
  • the processor 102 may also be referred to as a Central Processing Unit (CPU) .
  • the processor 102 may be an integrated circuit chip able to process signals.
  • the processor 102 may also be a general purpose processor, a Digital Signal Processor (DSP) , an Application Specific Integrated Circuit (ASIC) , a Field-Programmable Gate Array (FPGA) or other programmable logic devices, a discrete gate or a transistor logic device, a discrete hardware component.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general purpose processor may be a microprocessor or any conventional processor.
  • the processor 102 may be implemented by a plurality of integrated circuit chips together.
  • FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure.
  • the computer-readable storage medium 30 stores computer programs 300, which can be read by a computer.
  • the computer programs 300 can be executed by a processor to implement the method mentioned in any of the above embodiments.
  • the computer programs 300 may be stored in a form of a software product on the computer readable storage medium 30 as described above, and may include a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, and the like) or a processor to perform all or some of the operations of the method described in the various embodiments of the present disclosure.
  • the computer-readable storage medium 30 that has the storage function may be a universal serial bus disc, a portable hard disc, a Read-Only Memory (ROM) , a Random Access Memory (RAM) , magnetic discs or optical discs, or various media that can store program codes, or terminal devices such as a computer, a server, a mobile phone, a tablet, and the like.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • terminal devices such as a computer, a server, a mobile phone, a tablet, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The present application provides a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target. The training method includes: obtaining a plurality of images taken at different points in time; obtaining a first class of static features and a second class of static features of the target in each of the images; fusing the first class of static features and the second class of static features in each of the images to obtain fused features; and performing training classification on the fused features of the at least some of the images until the entire network is converged. In this way, richness of the target features can be effectively improved, and a moving target recognition model that has a better feature expression ability and higher robustness may be obtained.

Description

TRAINING METHOD FOR RECOGNIZING MOVING TARGET, METHOD AND DEVICE FOR RECOGNIZING MOVING TARGET
The present application claims priority of Chinese Patent Application No. 202110802833. X, filed on July 15, 2021, in China National Intellectual Property Administration, the entire contents of which are hereby incorporated by reference in their entireties.
TECHNICAL FIELD
The present disclosure relates to the field of computer vision and machine learning, and in particular to a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target.
BACKGROUND
Recognizing a moving target refers to recognizing a pedestrian target in an image, wherein the image is captured while the pedestrian is walking. In the art, a relatively developed method for recognizing the pedestrian may include two types, a person re-identification method and a gait recognition method. The former method may extract static external features, such as dressing of the pedestrian, a hair style of the pedestrian, a backpack of the pedestrian, an umbrella of the pedestrian, and the like, from the image. The latter method may learn dynamic features, such as a walking posture, an amplitude of arm swinging, head shaking and shoulder shrugging, sensitivity of a motor nerve, and the like, based on continuous movements of the pedestrian.
During long-term research, the applicant discovers that, when performing the method in the art to recognize the moving target, one single feature is relied on, such as a static RGB image, or a contour image, and the like. Robustness of the feature is not sufficient. Therefore, accuracy of a recognition result may be low. In addition, some technical solutions in the art recognize the moving target recognition based on feature fusion. For example, global features of an RGB image may be fused with local features of the RGB image. In this way, feature modality is relatively unitary. Performance of an apparatus may be scarified, whereas accuracy of prefabricated matching may not be improved.
SUMMARY OF THE DISCLOSURE
The present disclosure provides a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target. In this way, robustness and accuracy of recognizing the moving target may be improved.
According to a first aspect, a training method of recognizing a moving target includes: obtaining a plurality of consecutive images; inputting the plurality of consecutive images  successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; and inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
In some embodiments, obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images includes: obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.
In some embodiments, the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images, includes: segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; and segmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
In some embodiments, fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features, includes: fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.
In some embodiments, the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training, includes: inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.
According to a second aspect, a method for recognizing a moving target includes: obtaining a plurality of consecutive images of a target to be recognized; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature  fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; and obtaining a recognition result based on the dynamic features.
In some embodiments, the obtaining a recognition result based on the dynamic features, includes: calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one; placing the cosine similarity in an order and obtaining a maximum cosine similarity; determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; and obtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.
In some embodiments, before the obtaining a plurality of consecutive images of a target to be recognized, the method further includes: establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.
According to a third aspect, an apparatus for recognizing a moving target includes a memory and a processor coupled to the memory. The memory stores program instructions, and the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of the above embodiments.
According to the present disclosure, a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target are provided. The training method for recognizing the moving target includes: obtaining a plurality of images taken at various time points; obtaining a first class of static features and a second class of static features of the target in each of the plurality of images; fusing the first class of static features and the second class of static features in each of the plurality of images to obtain a fused feature; performing classification training on the fused feature of at least some of the plurality of images until the entire network is converged. In this way, the two classes of static features in one image are extracted, spliced and fused. Further, a plurality of consecutive fused features are input to a classification trainer. Both static and dynamic features of the moving target are considered at the  same time, richness of the features of the target may be improved effectively. The problem of the feature modality being unitary in the art may be solved. In this way, the final trained moving target recognition model has a stronger feature expression ability and better robustness. The accuracy of recognition results may be improved when applying the present model for recognizing the moving target.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the accompanying drawings for the description of the embodiments will be briefly described. Obviously, the drawings described below only illustrate some embodiments, an ordinary skilled person in the art may obtain other drawings based on these drawings, without making any creative work.
FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure.
FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure.
FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure.
FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure.
FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure.
FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure.
FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
Technical solutions of the embodiments of the present disclosure will be clearly and completely described by referring to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only some, but not all, of the  embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by an ordinary skilled person in the art without making creative work shall fall within the scope of the present disclosure.
As shown in FIG. 1, FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure. In detail, the method includes following operations.
In an operation S101, a plurality of images are captured consecutively.
In an embodiment, firstly, several video data, which is required for training a network and shows a moving target moving in a natural state, may be obtained. A pedestrian detection and tracking tool may be taken to parse the video data into a target RGB image sequence that includes a plurality of consecutive frames of images. A plurality of RGB images that are cropped based on a human detection frame are normalized to obtain a standard target RGB image sequence. The standard target RGB image sequence is copied, and a front background and a rear background of the target are annotated to obtain a target contour image. In the present embodiment, when normalizing the plurality of RGB images, the images may be scaled equally into a size of 96*64. When extracting the target contour image, a pedestrian area is labeled as 255, and a background area is labeled as 0. Finally, the RGB images and the contour image of the same person are labeled with identity information. So far, by performing the above operations, a standard set of RGB images and a standard set of contour images are obtained based on a same set of template RGB images. Further, consecutive RGB images and consecutive contour images cooperatively constitute the plurality of consecutive images.
In an operation S102, the first class of static features and the second class of static features of the target in each image are obtained.
Alternatively, the first class of static features of the target is obtained based on detailed features in the RGB images obtained in the operation S101, such as a dressing feature, a hairstyle feature, a backpack feature, and the like. The second class of static features of the target is obtained based on the contour image obtained in the operation S101. In the present embodiment, the first class of static features in the operation S102 refers to fine-grained static features of the target in each image, and the second class of static features refers to the fine-grained contour features. In other embodiments, coarse-grained static features and coarse-grained contour features of the target in each image may be extracted, serving as the first class of static features and the second class of static features, respectively. Recognition of the moving target may also be achieved in this way.
Alternatively, as shown in FIG. 2, FIG. 2 is a flow chart of an operation S102 shown in  FIG. 1 according to an embodiment of the present disclosure. The operation S102 may include following operations.
In the operation S201, the moving target is segmented into a plurality of portions, the plurality of portions are successively input into a first input end of an inner layer of a two-layer Vision Transformer (ViT) feature fusion model to obtain the fine-grained static features.
Alternatively, the ViT -based two-layer feature fusion model may process image sequence data where the target is continuously shown. Compared to a traditional convolutional neural network (CNN) algorithm, while computational accuracy is comparable, applying a ViT algorithm for training and inferencing may generate a small computation amount, and the ViT algorithm may be light weighting. In other embodiments, the static features corresponding to the target may also be obtained by applying a feature fusion model based on the convolutional neural network algorithm to inference and compute the image.
In the present embodiment, as shown in FIG. 3, FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure. The target may be segmented firstly. The RGB image may be segmented into 6 portions in an order of a head of the target, a middle-half of the target, and a lower-half of the target, and the 6 portions are equally sized. Subsequently, the 6 portions are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model, i.e., input into an RGB image input end, such that the fined-grained static features of the target are obtained.
In an operation S202, a contour of the target are segmented into a plurality of portions by the means mentioned in the above, and the plurality of portions are input into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
Alternatively, as shown in FIG. 3, by applying the segmentation method for segmenting the RGB image in the operation S201, the contour of the target is segmented into 6 portions that are equally sized. Subsequently, the 6 portions are successively input into the second input end of the inner layer of the two-layer ViT feature fusion model, i.e., a contour image input end, to obtain the fined-grained contour features of the target.
In an operation S103, the first class of static features and the second class of static features in each image are fused to obtain the fused feature.
Alternatively, in the operation S103, the first class of static features, which are obtained based on one RGB image and one contour image, and the second class of static features, which are obtained based on one RGB image and one contour image, are spliced and fused. Both the  static features and the contour features of the moving target are considered, richness of the features of the target may be improved effectively.
In the present embodiment, the fine-grained static features and the fine-grained contour features are fused by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused feature. For example, when a weight factor of the fine-grained static features is set to be 0.5, a weight factor of the fine-grained contour features is 0.5. In this case, the fused features is a sum of a product of 0.5 and the fine-grained static features and a product of 0.5 and the fine-grained contour features.
In an operation S104, classification training is performed on the fused features of at least some of the images until the entire network is converged.
In the operation S104, the at least some of the images refer to some consecutive frames of images selected from all of the plurality of images obtained in the operation S101. The fused features corresponding to the some consecutive frames of images may express the dynamic features of the target while the target is walking, such that an expression ability of the model may be improved. Preferably, five consecutive frames of RGB images and contour images are selected for classification training. In this way, the accuracy of the recognition result may be ensured, and the amount of computation may be reduced as much as possible.
In the present embodiment, as shown in FIG. 3, fused features of the five frames of images are successively input to the input end of the outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged. In a specific implementation scenario, classification training based on normalized exponential loss may be applied, wherein dimension of an embedding layer is set to be positive integer times of 128, such as 128, 512, 1024, and the like, until the entire network is converged to obtain a recognition result of the moving target that meets a predefined condition.
According to the training method for recognizing the moving target described in the embodiments of the present disclosure, the fine-grained static features and the fine-grained contour features are extracted from one RGB image and one contour image. By making full use of the two classes of static features, the dynamic features of the pedestrian included in a sequence of consecutive frames of images in a video are focused, such that the problem of the feature modality in the art being unitary may be solved. The two-layer ViT feature fusion model may be applied to fuse the three types of features. In this way, the final trained model has a stronger feature expression ability, higher robustness and a better differentiation ability. Applying the model to recognize the moving target may improve the accuracy of the recognition result.
As shown in FIG. 4, FIG. 4 is a flow chart of a method for recognizing a moving target  according to an embodiment of the present disclosure. The method for recognizing the moving target according to the embodiment of the present disclosure includes following operations.
In an operation S301, a plurality of consecutive images of the target to be recognized are obtained.
Alternatively, a video, that shows the target to be recognized is moving, is obtained and pre-processed firstly. Subsequently, a target RGB image sequence is obtained by a pedestrian detection and tracking tool. The RGB images are then normalized to obtain a standard target RGB image sequence. The standard target RGB image sequence is copied, and a front background and a rear background of the target are annotated to obtain the target contour image.
In an operation S302, the first class of static features and the second class of static features of the target to be recognized in each image are obtained.
Alternatively, in the present embodiment, the RGB images and the contour image obtained in the operation S301 are segmented in a same manner and are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain fine-grained static features and the fine-grained contour features.
In an operation S303, the first class of static features and the second class of static features in each image are fused to obtain the fused feature.
In the present embodiment, the operation S303 is similar to the operation S103 in FIG. 1. The operation S303 will not be repeatedly described for providing a concise description.
In an operation S304, the fused feature of at least some of the images are fused to obtain the dynamic features.
Alternatively, the fused feature corresponding to the plurality of consecutive frames of images are input to the input end of the outer layer of the two-layer ViT feature fusion model and are fused to obtain the dynamic features corresponding to the target to be recognized. The dimension of the embedding layer is set to be 1024, and the output dynamic features are represented by a 1024-dimension feature vector.
In an operation S305, the recognition result is obtained based on the dynamic features.
As shown in FIG. 5, FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure. The operation S305 may include following operations.
In an operation S401, cosine similarity between the dynamic features and each of all features stored in a base library of the moving target is calculated successively.
Alternatively, in the present embodiment, 100 features are stored in the base library of the moving target. The dynamic features of the target to be recognized are compared to each of  the 100 stored features one by one, and the cosine similarity therebetween is calculated. At last, 100 cosine similarity values are obtained.
In an operation S402, the cosine similarity values are placed in an order, and a maximum cosine similarity value is obtained.
In the present embodiment, the above 100 cosine similarity values are placed in the order, such that the maximum cosine similarity value is obtained.
In an operation S403, it may be determined whether the maximum cosine similarity value is greater than a predetermined recognition threshold.
In an operation S404, in response to the maximum cosine similarity value being greater than the predetermined recognition threshold, a stored feature corresponding to the maximum cosine similarity value is obtained, and identity information corresponding to the stored feature is taken as the recognition result of the target to be recognized.
In an operation S405, in response to the maximum cosine similarity value being less than the predetermined recognition threshold, recognition is terminated.
In the present embodiment, before performing the operation S401, the method further includes a process of establishing the base library of the moving target. As shown in FIG. 6, FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure. The process of establishing the base library of the moving target includes following operations.
In an operation S501, all videos that show the target to be stored is in a walking state are provided.
In an operation S502, each of the all videos is pre-processed, and a plurality of consecutive images in each of the all videos are obtained successively.
In an operation S503, the plurality of images are input into the trained two-layer ViT feature fusion model to obtain the dynamic features corresponding to each pedestrian target to be stored.
In an operation S504, a mapping relationship between each pedestrian to be stored and corresponding dynamic features is constructed, and the mapping relationship is stored into the base library of the moving target.
According to the method for recognizing the moving target of the embodiments of the present disclosure, the fine-grained static features and the fine-grained contour features in one RGB image and one contour image are extracted. The two classes of static features are fully utilized, and the dynamic features of pedestrians included in a sequence of consecutive frames in the video are focused, such that the problem of the feature modality in the art being unitary may  be solved. The two-layer ViT feature fusion model may be applied to fuse the three types of features, effectively improving the accuracy of recognition result.
As shown in FIG. 7, FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure. The apparatus includes an obtaining module 10, a fusing module 12 and a training module 14. In detail, the obtaining module 10 is configured to obtain a plurality of images taken at various time points and to obtain the first class of static features and the second class of static features of the target in each of the plurality of images. The fusing module 12 is configured to fuse the first class of static features and the second class of static features in each of the plurality of images to obtain the fused feature. The training module 14 is configured to perform classification training on the fused feature of at least some of the plurality of images until the entire network is converged. In this way, two classes of static features in one image are extracted, spliced and fused. A plurality of consecutive fused features are input into the classification trainer. Richness of the features of the target may be improved effectively, while both static and dynamic features of the moving target are considered, such that the problem of the feature modality in the art being unitary may be solved. The final trained model has a greater feature expression ability and higher robustness. By applying the present model for recognizing the moving target, the accuracy of the recognition result may be improved.
As shown in FIG. 8, FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure. The apparatus 20 includes a memory 100 and a processor 102 coupled to the memory 100. Program instructions are stored in the memory 100. The processor 102 is configured to execute the program instructions to implement the method according to any one of embodiments in the above.
In detail, the processor 102 may also be referred to as a Central Processing Unit (CPU) . The processor 102 may be an integrated circuit chip able to process signals. The processor 102 may also be a general purpose processor, a Digital Signal Processor (DSP) , an Application Specific Integrated Circuit (ASIC) , a Field-Programmable Gate Array (FPGA) or other programmable logic devices, a discrete gate or a transistor logic device, a discrete hardware component. The general purpose processor may be a microprocessor or any conventional processor. In addition, the processor 102 may be implemented by a plurality of integrated circuit chips together.
As shown in FIG. 9, FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure. The computer-readable storage medium 30 stores computer programs 300, which can be read by a computer. The computer programs 300  can be executed by a processor to implement the method mentioned in any of the above embodiments. The computer programs 300 may be stored in a form of a software product on the computer readable storage medium 30 as described above, and may include a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, and the like) or a processor to perform all or some of the operations of the method described in the various embodiments of the present disclosure. The computer-readable storage medium 30 that has the storage function may be a universal serial bus disc, a portable hard disc, a Read-Only Memory (ROM) , a Random Access Memory (RAM) , magnetic discs or optical discs, or various media that can store program codes, or terminal devices such as a computer, a server, a mobile phone, a tablet, and the like.
The above description shows only embodiments of the present disclosure and does not limit the scope of the present disclosure. Any equivalent structure or equivalent process transformation based on the contents of the specification and accompanying drawings of the present disclosure, applied directly or indirectly in other related art, shall be included in the scope of the present disclosure.

Claims (9)

  1. A training method of recognizing a moving target, comprising:
    obtaining a plurality of consecutive images;
    inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images;
    fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; and
    inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
  2. The training method according to claim 1, wherein obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images comprises:
    obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.
  3. The training method according to claim 2, wherein the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images, comprises:
    segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; and
    segmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
  4. The training method according to claim 3, wherein fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features, comprises:
    fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.
  5. The training method according to claim 1, wherein the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training, comprises:
    inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.
  6. A method for recognizing a moving target, comprising:
    obtaining a plurality of consecutive images of a target to be recognized;
    inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images;
    fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features;
    inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; and
    obtaining a recognition result based on the dynamic features.
  7. The method according to claim 6, wherein the obtaining a recognition result based on the dynamic features, comprises:
    calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one;
    placing the cosine similarity in an order and obtaining a maximum cosine similarity;
    determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; and
    obtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.
  8. The method according to claim 7, wherein before the obtaining a plurality of consecutive images of a target to be recognized, the method further comprises:
    establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.
  9. An apparatus for recognizing a moving target, comprising a memory and a processor coupled to the memory, wherein the memory stores program instructions, the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of claims 6 to 8.
PCT/CN2021/128515 2021-07-15 2021-11-03 Training method for recognizing moving target, method and device for recognizing moving target WO2023284182A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110802833.X 2021-07-15
CN202110802833.XA CN113255630B (en) 2021-07-15 2021-07-15 Moving target recognition training method, moving target recognition method and device

Publications (1)

Publication Number Publication Date
WO2023284182A1 true WO2023284182A1 (en) 2023-01-19

Family

ID=77180490

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/128515 WO2023284182A1 (en) 2021-07-15 2021-11-03 Training method for recognizing moving target, method and device for recognizing moving target

Country Status (2)

Country Link
CN (1) CN113255630B (en)
WO (1) WO2023284182A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255630B (en) * 2021-07-15 2021-10-15 浙江大华技术股份有限公司 Moving target recognition training method, moving target recognition method and device
CN113688745B (en) * 2021-08-27 2024-04-05 大连海事大学 Gait recognition method based on related node automatic mining and statistical information
CN116110131B (en) * 2023-04-11 2023-06-30 深圳未来立体教育科技有限公司 Body interaction behavior recognition method and VR system
CN116844217B (en) * 2023-08-30 2023-11-14 成都睿瞳科技有限责任公司 Image processing system and method for generating face data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190095764A1 (en) * 2017-09-26 2019-03-28 Panton, Inc. Method and system for determining objects depicted in images
CN110555406A (en) * 2019-08-31 2019-12-10 武汉理工大学 Video moving target identification method based on Haar-like characteristics and CNN matching
CN112686193A (en) * 2021-01-06 2021-04-20 东北大学 Action recognition method and device based on compressed video and computer equipment
CN113096131A (en) * 2021-06-09 2021-07-09 紫东信息科技(苏州)有限公司 Gastroscope picture multi-label classification system based on VIT network
CN113255630A (en) * 2021-07-15 2021-08-13 浙江大华技术股份有限公司 Moving target recognition training method, moving target recognition method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766925B (en) * 2018-12-20 2021-05-11 深圳云天励飞技术有限公司 Feature fusion method and device, electronic equipment and storage medium
US10977525B2 (en) * 2019-03-29 2021-04-13 Fuji Xerox Co., Ltd. Indoor localization using real-time context fusion of visual information from static and dynamic cameras
CN110246518A (en) * 2019-06-10 2019-09-17 深圳航天科技创新研究院 Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features
CN111160194B (en) * 2019-12-23 2022-06-24 浙江理工大学 Static gesture image recognition method based on multi-feature fusion
CN111582126B (en) * 2020-04-30 2024-02-27 浙江工商大学 Pedestrian re-recognition method based on multi-scale pedestrian contour segmentation fusion
CN111814857B (en) * 2020-06-29 2021-07-06 浙江大华技术股份有限公司 Target re-identification method, network training method thereof and related device
CN111860291A (en) * 2020-07-16 2020-10-30 上海交通大学 Multi-mode pedestrian identity recognition method and system based on pedestrian appearance and gait information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190095764A1 (en) * 2017-09-26 2019-03-28 Panton, Inc. Method and system for determining objects depicted in images
CN110555406A (en) * 2019-08-31 2019-12-10 武汉理工大学 Video moving target identification method based on Haar-like characteristics and CNN matching
CN112686193A (en) * 2021-01-06 2021-04-20 东北大学 Action recognition method and device based on compressed video and computer equipment
CN113096131A (en) * 2021-06-09 2021-07-09 紫东信息科技(苏州)有限公司 Gastroscope picture multi-label classification system based on VIT network
CN113255630A (en) * 2021-07-15 2021-08-13 浙江大华技术股份有限公司 Moving target recognition training method, moving target recognition method and device

Also Published As

Publication number Publication date
CN113255630B (en) 2021-10-15
CN113255630A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
WO2023284182A1 (en) Training method for recognizing moving target, method and device for recognizing moving target
US20210012198A1 (en) Method for training deep neural network and apparatus
US20210271862A1 (en) Expression recognition method and related apparatus
US20220262162A1 (en) Face detection method, apparatus, and device, and training method, apparatus, and device for image detection neural network
WO2020228446A1 (en) Model training method and apparatus, and terminal and storage medium
WO2021135509A1 (en) Image processing method and apparatus, electronic device, and storage medium
CN109902548B (en) Object attribute identification method and device, computing equipment and system
US20220012533A1 (en) Object Recognition Method and Apparatus
WO2021068323A1 (en) Multitask facial action recognition model training method, multitask facial action recognition method and apparatus, computer device, and storage medium
US20180018503A1 (en) Method, terminal, and storage medium for tracking facial critical area
CN110782420A (en) Small target feature representation enhancement method based on deep learning
US20230076266A1 (en) Data processing system, object detection method, and apparatus thereof
CN112070044B (en) Video object classification method and device
US10339369B2 (en) Facial expression recognition using relations determined by class-to-class comparisons
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
CN113673510B (en) Target detection method combining feature point and anchor frame joint prediction and regression
WO2021047587A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
CN113807399A (en) Neural network training method, neural network detection method and neural network detection device
CN114519877A (en) Face recognition method, face recognition device, computer equipment and storage medium
CN111108508A (en) Facial emotion recognition method, intelligent device and computer-readable storage medium
CN114549557A (en) Portrait segmentation network training method, device, equipment and medium
CN113487610B (en) Herpes image recognition method and device, computer equipment and storage medium
CN115410240A (en) Intelligent face pockmark and color spot analysis method and device and storage medium
Kumar et al. Facial emotion recognition and detection using cnn
CN111898473B (en) Driver state real-time monitoring method based on deep learning

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE