CN113408448A

CN113408448A - Method and device for extracting local features of three-dimensional space-time object and identifying object

Info

Publication number: CN113408448A
Application number: CN202110709917.9A
Authority: CN
Inventors: 朱世强; 沈旭; 黄镇; 田新梅; 顾建军; 姜峰
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-17

Abstract

The invention discloses a method and a device for extracting local features of a three-dimensional space-time object and identifying the object. The method comprises the following steps: acquiring a three-dimensional space-time object video frame sequence; analyzing the obtained pedestrian contour image and positioning the position of a specific area of a human body; sampling local area space-time characteristics according to the positioned specific area position; and fusing the local features and the global features to obtain output features. The invention is superior to the best algorithm at present on a plurality of gait recognition data sets, and solves the technical problem that the spatial-temporal three-dimensional characteristics extracted in the related technology are greatly influenced by the appearance of the pedestrian, so that the recognition result is inaccurate.

Description

Method and device for extracting local features of three-dimensional space-time object and identifying object

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method and a device for extracting local features of a three-dimensional space-time object and identifying the object.

Background

Nowadays, deep learning technology is widely applied to various fields, and gait recognition is an emerging biological feature recognition technology, which aims to perform identity recognition through walking postures of people by utilizing deep learning. In the field of intelligent video monitoring, the method has more advantages than image recognition. The key of gait recognition is that different targets are recognized through different walking modes, different parts of a human body often have different appearances and different motion modes, and a model should be provided with features of a specific local area in a specific time period.

The existing gait recognition method divides the extraction of spatial features and the extraction of temporal features into two steps: the appearance characteristics of each frame are independently extracted from each frame image through a space module, and then the time sequence correlation among different frame characteristics is extracted through a time module. The disadvantage of this approach is that the spatial and temporal correlation is not taken into account and the motion of the object tends to mean that changes occur in both time and space. Moreover, the previous method still stays in a very original simple stage in the utilization of the gait local features, the appearance features of the whole human body are divided into a plurality of horizontal bars, each horizontal bar is considered as a local feature, and all the local features are spliced together to serve as the final gait features through a module for extracting time sequence correlation respectively. This approach has two disadvantages, firstly, different human body parts have different sizes, even if the same parts have different sizes in different frames; secondly, different parts of the body have motion patterns, such as which frame the motion starts from, to which frame the motion ends, the frequency and speed of the motion, etc. How to better obtain three-dimensional space-time local characteristics so as to guarantee subsequent application is an urgent problem to be solved in the industry.

Disclosure of Invention

The invention aims to provide a method and a device for extracting local features of a three-dimensional space-time object and identifying the object, aiming at the defects of the prior art. The invention can extract the three-dimensional local area of a specific position, independently extract the characteristics of different human body areas, improve the generalization and expression capability of the final characteristics, and solve the problems that the identification result is greatly influenced by the appearance of a pedestrian and the final gait identification result is inaccurate because the correlation of time-space information is not considered and the local characteristics are not fully utilized when the characteristics are extracted in the current three-dimensional gait identification field, thereby better ensuring the identification of an object.

The purpose of the invention is realized by the following technical scheme: a method for extracting local features of a three-dimensional space-time object comprises the following steps:

and positioning the local characteristic region of the three-dimensional space-time object according to the high-level characteristic information and the prior knowledge of the three-dimensional space-time object.

And respectively sampling the located local characteristic regions to acquire sampling information.

And respectively extracting the characteristics of the sampled three-dimensional space-time information to obtain local characteristic information of a local characteristic region.

Further, the high-level feature information is obtained by extracting features of the three-dimensional space-time object through a neural network model.

Further, the a priori knowledge is used for representing different parts of the three-dimensional space-time object and comprises one or more pieces of sub-block information which are learned in advance and are used for carrying out region division on the three-dimensional space-time object.

Further, locating local feature regions of the three-dimensional spatiotemporal object, comprising: and positioning one or more local characteristic regions from the high-level characteristic information of the three-dimensional space-time object according to different parts of the three-dimensional space-time object represented by the prior knowledge.

Further, the sampling includes up-sampling or down-sampling.

Further, the size of the sampling information graph corresponding to the sampling information is smaller than or equal to the size of the high-level feature graph corresponding to the high-level feature information.

Further, acquiring local feature information of the local feature region, including: and performing feature extraction on the sampling information graph corresponding to the sampling information by using a neural network model to obtain local feature information of a local feature area.

A three-dimensional space-time object identification method based on the three-dimensional space-time object local feature extraction method comprises the following steps:

obtaining local characteristic information of a three-dimensional space-time object, comprising: according to the high-level characteristic information and the prior knowledge of the three-dimensional space-time object, positioning a local characteristic region of the object; sampling the positioned local characteristic areas respectively to acquire sampling information; and respectively extracting the characteristics of the sampling information to obtain the local characteristic information of the local characteristic region.

And simultaneously combining the obtained local characteristic information and high-level characteristic information and the local characteristic information and the high-level characteristic information of the video frames at different time points to obtain an object identification result.

A computer-readable storage medium storing computer-executable instructions for performing the above-described three-dimensional spatiotemporal object recognition method.

An apparatus for implementing interface preloading, comprising a memory and a processor, the memory having stored therein the following instructions executable by the processor: the method comprises the steps of executing the three-dimensional space-time object identification method.

The invention has the beneficial effects that: the invention analyzes the obtained space-time human body gait video frame sequence, positions and samples the position of the local area of the human body, simultaneously extracts two characteristics of time and space of the local area from the video frame sequence, namely the real-time space dynamic characteristic, and realizes the aim of simultaneously paying attention to the space-time local information because the space-time dynamic characteristic is fused with static and dynamic information and identifies the human body action based on the space-time dynamic characteristic, thereby realizing the technical effect of improving the accuracy of identifying the three-dimensional object and further solving the problems that the identification result is greatly influenced by the appearance of the pedestrian and finally leads to inaccurate gait identification result in the related technology.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of the three-dimensional space-time local feature extraction method of the present invention;

FIG. 2 is a schematic diagram of the structure of the three-dimensional space-time local feature extraction device according to the present invention;

FIG. 3 is a flow chart of a method of three-dimensional spatiotemporal object identification of the present invention;

FIG. 4 is a schematic diagram of the structure of the three-dimensional spatiotemporal object recognition apparatus of the present invention;

FIG. 5 is a schematic diagram of the process of extracting and fusing local features of three-dimensional spatiotemporal objects according to the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present invention are applicable to the following explanations:

deep Learning (Deep Learning): deep learning refers to an algorithm set for solving various problems such as images and texts by applying various machine learning algorithms on a multilayer neural network. Deep learning can fall into neural networks in a broad category, but there are many variations on the specific implementation. The core of deep learning is feature learning, and aims to obtain hierarchical feature information through a hierarchical network, so that the important problem that features need to be designed manually in the past is solved.

Gait Recognition (Gait Recognition): gait recognition is a new biological feature recognition technology, aims to identify the identity through the walking posture of people, and has the advantages of non-contact remote distance and difficulty in camouflage compared with other biological recognition technologies. In the field of intelligent video monitoring, the method has more advantages than image recognition.

The three-dimensional space-time local features refer to features extracted from partial time sequence segments and local space regions of a video, for example, in the walking process of a person, the head shakes in certain time periods, and the partial local motion features are extracted in a targeted manner, so that the expression capacity of the features can be improved, and the model can be helped to identify the identities of different persons.

In the embodiment of the invention, the local characteristic region of the object is positioned by analyzing the gait video frame sequence; sampling the local characteristic region to obtain sampling information; respectively extracting characteristics of the sampling information to obtain space-time characteristic information of a local characteristic region; and identifying the three-dimensional space-time object based on the fusion of the extracted space-time local information and the global information.

In one exemplary configuration of the invention, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Example one

As shown in fig. 1, the method for extracting local features of a three-dimensional space-time object of the present invention includes:

step 100: and positioning the local characteristic region of the three-dimensional space-time object according to the high-level characteristic information and the prior knowledge of the three-dimensional space-time object.

In one illustrative example, the high-level feature information is: and obtaining characteristic information after extracting the characteristics of the object by using the neural network model. The high-level feature information describes global features of the object. The high-layer feature information has a greater discriminative power than the low-layer feature information and the medium-layer feature information.

The high-level attribute features contain richer semantic information and have stronger robustness to illumination and view angle changes.

Therefore, the invention utilizes the high-level attribute characteristics to guide the low-level characteristics to realize the positioning of the local characteristic region, and ensures that the reasonable local characteristic region is positioned, thereby providing guarantee for effectively improving the identification performance.

Taking a gait recognition scene as an example, the adopted characteristic information can include: low-level visual features, middle-level filter features, and high-level attribute features. Wherein, the low-level visual features and the combination thereof are feature information commonly used in gait recognition. The combination of multiple low-level visual features has more information and better distinguishing capability than the single feature, so the low-level visual features are often combined for gait recognition. The middle layer filter features refer to feature information extracted from image block combinations with strong distinguishing capability in pedestrian images. The filter reflects special visual patterns of the pedestrian, the visual patterns correspond to different body parts, and the special body structure information of the pedestrian can be effectively expressed. The high-level attribute features refer to positions, sizes, starting time, ending time, duration and the like of human body parts, and have stronger distinguishing capability than low-level visual features and middle-level filter features.

In one exemplary embodiment, a priori knowledge is used to characterize different parts of the subject, including one or more pre-learned sub-block information for regionalizing the subject. Such as: taking an image in which the object is a person as an example, the prior knowledge may include: head position information, upper body position information, lower body position information, and the like. The following steps are repeated: taking the image of the bird as the object as an example, the prior knowledge may include: beak, head, wings, tail, claws, etc.

In an exemplary instance, locating the local feature region of the object in this step may include: and according to different parts of the object represented by the prior knowledge, positioning one or more local characteristic regions from the high-level characteristic information of the object. Taking a gait recognition scene as an example, assuming that the priori knowledge includes information of three regions, namely a head region, an upper body and a lower body, after the processing of the step, a head feature region, an upper body feature region and a lower body feature region of the object are located, and the three regions are cut out to obtain three local region feature maps, namely a head region feature map, an upper body region feature map and a lower body region feature map.

In an exemplary embodiment, a method such as directly defining a local feature region using human information may also be used.

In the invention, an additional training data set is not needed for auxiliary training, an additional model is not needed for assisting learning or positioning, and a local characteristic region capable of expressing object characteristics can be positioned only by utilizing simple prior knowledge related to tasks.

Step 101: and respectively sampling the local characteristic regions of the positioned three-dimensional space-time object to acquire sampling information.

In an exemplary embodiment, the located local feature areas are up-sampled respectively, so as to achieve the effect of enlarging the local feature map. Through the processing of the step, rational guarantee is provided for obtaining more detailed local characteristic information subsequently.

It should be noted that, according to actual requirements, the sampling in this step may also be downsampling, and what sampling manner is specifically adopted may be determined according to actual situations, and is not used to limit the protection scope of the present invention.

In an exemplary embodiment, the size of the sampling information map corresponding to the sampling information is smaller than or equal to the size of the high-level feature map corresponding to the high-level feature information.

In an exemplary embodiment, the localized local feature region may also be sampled using conventional interpolation. The interpolation means can enlarge the image; if reduced, a conventional downsampling method is used.

In the invention, the local characteristic region is amplified or reduced through sampling, and expected local characteristic information is more flexibly obtained.

Step 102: and sampling the local characteristic region to obtain sampling information.

In one illustrative example, the step may include: and performing feature extraction on the sampling information graph corresponding to the sampling information by using the neural network model to acquire local feature information of the local feature area.

The local feature extraction method can be inserted into any sublayer in any neural network model, such as a convolution layer and a deconvolution layer. According to the local feature extraction method, better local features are simply and effectively obtained, and therefore follow-up object identification is better guaranteed.

The invention also provides a computer-readable storage medium storing computer-executable instructions for performing the local feature extraction method of any one of the above.

The invention further provides a computer device, which comprises a memory and a processor, wherein the memory stores the steps of any one of the local feature extraction methods.

Example 2

As shown in fig. 2, the present invention provides a device for extracting local features of a three-dimensional space-time object, which at least comprises: the device comprises a positioning module, a sampling module and an extraction module.

And the positioning module is used for positioning the local characteristic region of the three-dimensional space-time object according to the high-level characteristic information and the prior knowledge of the three-dimensional space-time object.

And the sampling module is used for sampling the positioned local characteristic areas respectively to acquire sampling information.

And the extraction module is used for respectively extracting the characteristics of the sampling information and acquiring the local characteristic information of the local characteristic region.

In an illustrative example, the sampling module is specifically configured to: and respectively carrying out up-sampling on each positioned local characteristic region so as to amplify a local region characteristic map.

In an exemplary embodiment, the extraction module is specifically configured to: and performing feature extraction on the sampling information graph corresponding to the sampling information by using the neural network model to acquire local feature information of the local feature region.

Example 3

As shown in fig. 3, the present invention provides a method for identifying three-dimensional objects, comprising:

step 300: obtaining local characteristic information of a three-dimensional space-time object, comprising: according to the high-level characteristic information and the prior knowledge of the three-dimensional space-time object, positioning a local characteristic region of the object; sampling the positioned local characteristic areas respectively to acquire sampling information; and respectively extracting the characteristics of the sampling information to obtain the local characteristic information of the local characteristic region.

The implementation of this step can refer to the local feature extraction method shown in fig. 1, and is not described here again.

Step 301: and combining the obtained three-dimensional space-time local characteristic information and the high-level characteristic information as an object identification result.

In an illustrative example, a 1 × 1 convolutional neural network model may be employed to merge the obtained local feature information and high-level feature information. The specific implementation is based on the local feature extraction method and the object recognition method provided by the present invention, and is easy to be implemented by those skilled in the art, and is not used to limit the protection scope of the present invention.

The high-level feature information describes the global features of the object, the extracted local features and the global features are fused through the step, the local features and the global features are not independently subjected to subsequent operation processing respectively, richer feature information is prepared for the identification of the object, and the subsequent identification of the object is further better guaranteed. Meanwhile, the extraction of the local features in the object recognition does not need an additional training data set for auxiliary training, does not need to train an additional model for assisting in learning or positioning, and can position the local feature region capable of expressing the object features only by utilizing simple task-related prior knowledge; on the other hand, the local characteristic region which is positioned is amplified or reduced through sampling, and expected local characteristic information is more flexibly obtained; thus, better local characteristics are simply and effectively obtained.

The object identification method can be inserted into any sublayer in any neural network model, such as a convolution layer and a deconvolution layer.

Example 4

As shown in fig. 4, the present invention provides an apparatus for three-dimensional spatiotemporal object recognition, comprising: the device for extracting the local features of the three-dimensional space-time object (local feature extraction unit) and the fusion unit.

A local feature extraction unit configured to acquire local feature information of an object, including: the positioning module is used for positioning a local characteristic region of the object according to the high-level characteristic information and the prior knowledge of the object; the sampling module is used for sampling the positioned local characteristic regions respectively to acquire sampling information; and the extraction module is used for respectively extracting the characteristics of the sampling information to obtain the local characteristic information of the local characteristic region.

And the fusion unit is used for combining the acquired three-dimensional space-time local characteristic information and the acquired high-level characteristic information as an object identification result.

In an exemplary embodiment, the fusion unit is specifically configured to: and combining the obtained local characteristic information and the high-level characteristic information by adopting a 1 × 1 convolutional neural network model.

Example 5

As shown in fig. 5, taking gait recognition as an example, an embodiment of the present invention provides a three-dimensional object recognition method based on three-dimensional space-time object local feature extraction and fusion, including:

firstly, according to high-level feature information obtained by extracting features of an object, namely a video, by using a neural network model and prior knowledge of different parts representing the object obtained by pre-learning, carrying out region division on a high-level feature map according to regions of the prior knowledge, and positioning: a head feature region, an upper body feature region, and a lower body feature region; in this embodiment, three regions of the head, the upper body, and the lower body are assumed to be a priori known.

And then, respectively carrying out up-sampling on the three positioned local feature areas to obtain a head feature map, an upper body feature map and a lower body feature map which have the same scale as the high-level feature map.

And then, respectively extracting the characteristics of the head characteristic diagram, the upper body characteristic diagram and the lower body characteristic diagram to obtain head characteristic information, upper body characteristic information and lower body characteristic information.

And finally, combining the obtained local characteristic information and the high-level characteristic information by adopting a 1 × 1 convolutional neural network model to obtain an object identification result.

In the embodiment of pedestrian re-identification, the problem that more useful local characteristic information cannot be better extracted by using a mask technology when the size of the local characteristic information in an image is smaller in the related technology is well solved by sampling and amplifying the local characteristic information; moreover, the invention can more flexibly obtain the expected local characteristic information. Compared with the scheme of carrying out pedestrian recognition by means of human body structure information such as joint points in the related technology, the embodiment of pedestrian re-recognition can locate the local characteristic region capable of expressing the object characteristics by using simple prior knowledge related to tasks without additional training data set for auxiliary training or training additional models for assisting learning or positioning, so that better local characteristics are simply and effectively obtained. In addition, the fusion module fuses the spatiotemporal information at the same time, so that the correlation of the spatiotemporal information can be better extracted, and the accuracy and the generalization capability of the model are improved.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for extracting local features of a three-dimensional space-time object is characterized by comprising the following steps:

2. The method of claim 1, wherein the high-level feature information is extracted from the three-dimensional spatiotemporal object by a neural network model.

3. The method of claim 1, wherein the a priori knowledge is used to characterize different parts of the three-dimensional spatio-temporal object, and comprises one or more pieces of sub-block information that are pre-learned to partition the three-dimensional spatio-temporal object into regions.

4. The method for extracting local features of three-dimensional space-time object according to claim 3, wherein locating the local feature region of the three-dimensional space-time object comprises: and positioning one or more local characteristic regions from the high-level characteristic information of the three-dimensional space-time object according to different parts of the three-dimensional space-time object represented by the prior knowledge.

5. The method for local feature extraction of three-dimensional space-time object according to claim 1, wherein the sampling comprises up-sampling or down-sampling, etc.

6. The method for extracting local features of three-dimensional spatiotemporal object according to claim 1, wherein the size of the sampling information graph corresponding to the sampling information is smaller than or equal to the size of the high-level feature graph corresponding to the high-level feature information.

7. The method for extracting the local features of the three-dimensional space-time object according to claim 1, wherein obtaining the local feature information of the local feature region comprises: and performing feature extraction on the sampling information graph corresponding to the sampling information by using a neural network model to obtain local feature information of a local feature area.

8. A three-dimensional space-time object identification method based on the method of any one of claims 1 to 7, characterized by comprising the following steps:

9. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of three-dimensional spatiotemporal object recognition according to claim 8.

10. An apparatus for implementing interface preloading, comprising a memory and a processor, wherein the memory has stored therein the following instructions executable by the processor: for performing the steps of the three-dimensional spatiotemporal object recognition method of claim 8.