CN109919106B

CN109919106B - Progressive target fine recognition and description method

Info

Publication number: CN109919106B
Application number: CN201910181642.9A
Authority: CN
Inventors: 卫志华; 沈雯; 张彬彬; 崔昊人; 李倩文
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2023-05-12
Anticipated expiration: 2039-03-11
Also published as: CN109919106A

Abstract

The invention discloses a progressive target fine recognition and description method, which takes video target recognition as a background and develops research work from a theory and a method of video characteristic multi-level acquisition and progressive target fine recognition and description. Firstly, detecting and dividing a video target so as to identify each component of the target; then, further extracting multi-granularity features of the video object based on the component identification; and finally, fusing the multi-granularity characteristics to realize the fine recognition of the target and generating fine description text information. The invention establishes a multi-level depth feature extraction method based on parts by simulating a method for recognizing and describing images by human, and provides effective theory and method for extracting video target features; a video target refined description method based on template matching is constructed through a natural language processing technology, and a new idea is provided for multi-level video target identification and description. The invention enriches and expands machine learning theory and method.

Description

Progressive target fine recognition and description method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a method for finely identifying and describing video targets.

Background

With the continuous popularization of video equipment and the increasing maturity of video monitoring technology, video monitoring is applied more and more widely, and the monitored video data volume presents explosive growth, and has become an important data object in the big data age. For example, millions of monitoring probes throughout Shanghai city generate TB-level video data every minute, providing valuable video resources for real-time mastering of social dynamics and ensuring public safety.

However, the unstructured nature of the video data itself makes its processing and analysis relatively difficult. At present, the target identification of video data is mainly manually analyzed, and a simple intelligent analysis means is used, so that the bottleneck of mass video data target identification such as 'video data in, target cannot be found', 'finding is achieved, finding is too long' exists. Meanwhile, the existing intelligent video analysis means have the problems of inaccurate identification, non-uniform feature description method and the like, and the problems severely restrict the further development and application of the video target identification technology. Therefore, how to realize the refined video target feature representation is a key problem to be solved in the intelligent analysis of videos facing to massive video big data.

Converting video information into text information characterizing a detection target is an effective way to solve the above-described problems. Video representation research based on this type of approach is mostly based on two types of approaches: (1) video object annotation: automatically adding category labels for objects in the video based on a machine learning algorithm, and representing video targets by the category labels; (2) video object understanding: based on computer vision and natural language understanding technology, natural language description of a video object is formed by extracting local features of objects in the video. The description of the video by the video target annotation is single, and the description of the object characteristics and the relevance among the objects is lacking; although the video target understanding may contain more information, the real scene is complex and changeable, and is difficult to uniformly define, so that a certain effect can be obtained only in a specific scene at present, and the video target understanding cannot serve the practical application.

Thus, the existence of these problems has led to intelligent analysis of video at a lower level. Aiming at the problems that the existing video target recognition method is marked with a single label, the spatial relationship of each component is difficult to accurately define and describe, and the like, a method capable of realizing fine recognition on targets in complex scenes is needed.

Disclosure of Invention

The invention aims to disclose a progressive target fine recognition and description method, aiming at the problems and difficulties existing in the current video monitoring, the research work is developed around the multi-level depth feature extraction and the fine target recognition and description of the video target. Mainly comprises three steps:

step one: component identification

Detecting and segmenting the video object to identify individual components of the object;

step two: multi-granularity feature extraction

Further extracting multi-granularity features of the video object based on the component identification;

step three: detailed description

And fusing the multi-granularity characteristics to realize the fine recognition of the target and generating fine description text information.

Aiming at the second step, the invention discloses a multi-level depth feature extraction algorithm based on a component, which is characterized in that the multi-level depth feature can be extracted from the same target based on the component. The multi-level features are that the component information of the object is added on the category label from a plurality of granularity layers, and the features of the depth features on different granularity layers are extracted by using a deep learning method. The algorithm aims at outputting multi-granularity characteristics of the components with category labels as cores to help describe video targets.

Aiming at the step three, the invention discloses a video target refined description algorithm based on template matching, which is characterized in that the video target refined description algorithm is a multi-granularity characteristic representation model of a video component, the characteristics of different layers are corresponding to different grain layers, and an information merging mechanism between the different grain layers is designed. The algorithm aims at fusing the multi-granularity information of the components and generating a structured video object fine description text.

The invention discloses a progressive target fine recognition and description method, which comprises the following specific implementation steps:

step one: component identification

1.1, extracting key frames of the collected video to generate a key frame image training set;

1.2 training a key frame image set by using a deep learning method, and detecting all targets in the key frame by using a regional advice neural network (Faster R-CNN);

1.3 based on the target detection result, using a region suggestion neural network (Faster R-CNN) based on real-time target detection to perform component detection on the target, and obtaining a target component image set.

Step two: multi-granularity feature extraction

2.1 extracting facial visual features based on a target head part using a Convolutional Neural Network (CNN) to perform age identification and gender identification on the target;

2.2 performing apparel category identification of the body part based on the target body part using the CNN extracted part coarse granularity feature;

2.3 based on the target body part, extracting fine-grained feature-maximum color domain feature of the target part image to perform clothing basic color recognition of the body part on the target.

Step three: detailed description

And 3.1, fusing the multi-granularity characteristics of the target component obtained in the step two, and generating a description sentence for precisely identifying the video target by using a natural language processing technology.

Advantageous effects

1. Aiming at the problems that the labeling of the existing video target recognition method is single, the accurate description of each part is difficult, and the like, the invention provides a progressive target fine recognition and description method which simultaneously learns and marks the multi-granularity information of the parts of a plurality of objects in the same video.

2. The invention uses a multi-level depth feature extraction algorithm based on the parts, and more accurately discovers the detail features of each body part of the target based on the multi-granularity features of the target body part segmentation extraction parts, so that the video target description is not limited to the analysis on the whole visual angle, and the accuracy and the richness of the target refined description content are ensured.

3. The invention uses a video target refined description algorithm based on template matching, fuses multi-granularity characteristics of video target components, establishes a multi-granularity video representation theory and method, and provides a new idea for video content representation. Meanwhile, by combining a natural language processing technology, text information with more complete video target descriptive performance is formed.

4. The invention enriches and expands machine learning theory and method, and simultaneously lays theory and application foundation for pushing intelligent analysis and development of video in the future.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and explain the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a progressive object refinement recognition algorithm

FIG. 2 is a schematic diagram of a progressive object refinement recognition algorithm

FIG. 3 target part detection recognition deep learning model map

FIG. 4 is a schematic representation of multi-level depth feature extraction and representation based on features

FIG. 5 target refinement identification example graph

FIG. 6 is a schematic diagram

Detailed Description

The following detailed description of embodiments of the present invention will be given with reference to the accompanying drawings and examples, by which the technical means of the present invention are applied to solve the technical problems and achieve the technical effects, and the implementation process can be fully understood and implemented accordingly.

The invention discloses a progressive target fine recognition and description method which is characterized in that a video target recognition is used as a background, and research work is carried out on theories and methods for extracting and refining target recognition and description from multi-level depth features of the video target. Firstly, detecting and dividing a video target so as to identify each component of the target; secondly, after the segmentation and the identification of the parts are completed, the multi-granularity characteristics of different parts are further extracted; and finally, fusing the multi-granularity characteristics to realize the fine recognition and description of the target and generating fine description text information.

The invention discloses a progressive target fine recognition and description method, which comprises the following steps:

step one: component identification

1.1 in this embodiment, the video is derived from a monitoring video of an important traffic gate in Shanghai city, and the resolution of the video is 1280 x 720p. The monitoring video scene is complex and comprises target pedestrians with different forms and different sizes. Firstly, extracting key frames from an acquired monitoring video. The key frames are required to meet the requirements of pedestrians, and the pedestrians are rich in form and size; and then generating a key frame image set and dividing the key frame image set into a training set, a verification set and a test set according to the proportion of 8:1:1.

1.2 training a keyframe image set using a deep learning method, constructing a regional advice neural network (Faster R-CNN). In the training process, the set output category is the video target pedestrian. The region suggestion neural network (Faster R-CNN) detects pedestrians in the video frame and generates a pedestrian detection box.

1.3 on the basis of pedestrian detection, different body parts of the target, such as specifically the head, upper and lower body parts, etc. of the same object are detected and identified using a target detection network, faster R-CNN, the network model being shown in FIG. 3.

Firstly, extracting a characteristic diagram of a key frame in a video by using a classical convolutional neural network, wherein a network layer (Region Proposal Network, RPN) and a full connection layer share are generated by a subsequent candidate frame of the characteristic diagram; then, the candidate frame generating network layer generates offset of foreground candidate frame, background candidate frame and frame regression through 3X 3 convolution respectively; judging whether the candidate frame belongs to the foreground or the background or not through softmax, correcting the size and the position of the candidate frame by utilizing frame regression, and finally obtaining the accurate candidate frame. The candidate box generates the loss function of the network layer as follows:

where i represents the index of the candidate box in the widget, p _i Representing the probability that the candidate box i is predicted as an object, if the candidate box is a positive example, then

Equal to 1, otherwise equal to0。t _i Is a vector representing 4 parameterized coordinates of the prediction bounding box,/for>

Representing the exact position of the positive candidate frame, L _cls Representing the loss function of the classification, L _reg Representing the regression loss function. N (N) _cls And N _reg Belonging to the normalization parameter.

And then, collecting the feature map and the candidate frames by using a happy regional pooling layer, extracting the feature map of the candidate frames obtained from the RPN network after integrating the information, and sending the feature map and the candidate frames to a subsequent full-connection layer to judge the target category. And finally, calculating the category in the feature map by utilizing the feature map of the candidate frame, and simultaneously obtaining the final accurate position of the detection frame again by using the boundary regression.

Through Faster RCNN, detection, identification and segmentation of target components are realized.

Step two: multi-granularity feature extraction

2.1 based on the pedestrian component detection result, the Age and sex of the pedestrian are identified by using an Age-sex identification network (age_ Gender Identification Network, AGI-Net). And taking the detected pedestrian head part as an input of the AGI-Net, and extracting facial features as the basis of age and gender identification. As shown in fig. 4.

The AGI-Net provided by the invention predicts the age and sex of a target by an improved FaceNet network structure (VGG-16 structure) on an input face image. In the AGI-Net model, the sex identification part is a two-class model whose identification result is male or female. Defining g as the true gender of the label, and g as the predicted gender of the model output, the formula of the loss function is defined as follows:

Loss＝-[g log g*+(1-g)log(1-g ^* )] (4)

the age identification part is used for quantitatively estimating the ages, and supposing that the age value is discretized into Y age ranges, each age range Y _i Age of coverage is Y _i ^min ～Y _i ^max Voting method is used in training samples to predict age range Y _i Number y _i 。

Age range |y| needs to satisfy: (a) A uniform range, i.e., wherein each age range covers the same number of years; (b) The balance range, i.e., the training samples covered by each age range, is approximately the same. Training the CNN in this way to classify age groups, the probability formula of the normalized output of the Y|age range neurons by softmax is:

where o= {1,2, |y| } is the output layer of the |y| dimension, O _i E O is the output probability of age range neuron i normalized by softmax.

2.2 based on the pedestrian component detection results, the clothing category network (close-Detection Network, CD-Net) was used to identify the clothing category of the pedestrian. And taking the detected head, upper body and lower body of the pedestrian as the input of the CD-Net, and learning the clothing characteristics of the parts to identify the clothing types of the body parts. As shown in fig. 4.

The CD-Net provided by the invention extracts depth features through a convolution layer and a pooling layer from a segmented target body part, and finally maps feature vectors to a vector with the same dimension as the number of the classes through a full connection layer, thereby obtaining the classes of the clothing.

For multi-classification problems, the problem of class confusion that easily occurs, we use a label combining algorithm based on LDA. Dividing apparel categories into several topics, d, θ for each picture _d ＝{pt ₁ ，pt ₂ ，...，pt _k -theme distribution, pt, representing picture d _k Representing the probability that this picture d belongs to the topic k. pt (pt) _k The calculation formula of (2) is as follows:

pt _k ＝nt _k /n _d (4)

wherein nt _k Representing the number of pictures with a topic k, n _d The number of representation pictures d divides the apparel class into several large tags. The training set is decomposed into label subsets according to the topic obtained by the LDA algorithm, and convolutional neural networks are trained respectively to obtain labelsA sub-network. And then outputting a predicted probability value for the original clothing category by using a first layer base network (BM-ConvNet) of the cascade model through the concept of cascade classification. The second layer of cascade is label sub-network (LDA-ConvNet-k, k is the number of subjects obtained by LDA algorithm). The algorithm flow of the cascade network is as follows:

step 6.1, inputting a sample into a first layer base network BM-ConvNet to obtain a predicted result L:

L＝{L ₁ ，L ₂ ，...，L _N } (5)

wherein N is the number of clothes categories. Prediction probability P:

P＝{P ₁ ，P ₂ ，...，P _N } (6)

step 6.2 for a predicted probability value greater than the threshold value P _min Category L of (2) _i Inputting the test sample into a sub-network LDA-ConvNet-k corresponding to the second layer to obtain a predicted result l:

wherein M is the predicted category number of LDA-ConvNet-k and the predicted probability p thereof _k ：

Then, the idea of probability coverage is adopted, namely the second prediction probability is taken as the final probability; for the class which does not accept the secondary prediction, the prediction probability of the first layer of the model is taken as the final probability. And finally, accurately predicting the clothes category. Probability p _i The calculation formula is as follows:

2.3 based on the pedestrian component detection result, the clothing basic color identification of the pedestrian body component is performed by using a method based on the maximum color domain identification. In particular, a garment color identification module (Module of Color Identification, MCI) is employed to identify the garment color details of a pedestrian. The detected head, upper body and lower body of the pedestrian are used as the input of the MCI, and the steps are as follows:

a. converting the picture color into HSV;

b. defining an HSV color dictionary with reference to an HSV color classification;

c. performing binarization treatment on the filtered color;

d. performing morphological corrosion expansion of the image;

e. and counting the area of the white area, wherein the largest area is the largest color gamut of the object.

The calculated maximum color gamut of the component can be used as a basis for the color recognition of the target body component.

Step three: detailed description

3.1 in this example, the target is described using a refined description algorithm based on template matching.

And (3) fusing the multi-granularity characteristics of the target component detected in the step two by adopting a natural language processing technology. In this embodiment, a template matching-based method is adopted. Firstly, defining different templates according to different body parts of the target identified in the first step; and then, integrating the multi-granularity characteristics obtained in the step two into templates according to the information of the coarse granularity and the fine granularity obtained by the classifier according to the difference of the coarse granularity class information templates, and finally generating text description information of the video target.

An overall schematic of this embodiment is shown in fig. 6.

While the foregoing description shows and describes several embodiments of the invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept described herein, either as a result of the foregoing teachings or as a result of the knowledge or technology of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Innovation point

The project is characterized in that from the actual demand of intelligent video analysis, the target component level and the target level in the video are gradually and progressively identified, so that the fine description of the target is achieved, and further, more accurate video target detection and identification are realized. The method is a comprehensive video target recognition path, forms rich target and feature description of parts thereof while carrying out target detection, and has better interpretability and descriptive property for practical application.

One of the innovations is: multi-granularity feature extraction and representation of target components

The project breaks through the traditional target recognition mode, the characteristics of the video target parts on different granularity are extracted from a plurality of granularity layers through a multi-level depth characteristic extraction algorithm based on the parts, and the target multi-granularity part characteristics can further provide support for the fine recognition of the targets.

And the innovation is as follows: progressive fine recognition and description of video objects

Conventional video object recognition often only provides global information of the object as descriptive information, ignoring component level detail features of the object, such as spatial location of the video object or face recognition of the video object. The project is intended to detect video object features from four granularity levels of the object's age, gender, body part clothing color, and body part clothing category. And (3) corresponding the features of different layers to different grain layers through a video target refinement description algorithm based on template matching, and designing an information merging mechanism between the different grain layers. The algorithm fuses the multi-granularity information of the components, generates a structured video target fine description text, provides a feasible solution for the depth analysis of the video target, and can better meet the actual requirements of the detection and analysis of the video target.

Claims

1. A progressive target fine recognition and description method is characterized by comprising the following steps of

Step one: component identification

step two: multi-granularity feature extraction

step three: detailed description

Fusing multi-granularity characteristics to realize fine recognition of the target and generating fine description text information;

the first step is as follows: component identification includes:

firstly, extracting key frames of an acquired monitoring video;

training a key frame image set by using a deep learning method, and constructing a regional suggestion neural network; in the training process, the set output category is the video target pedestrian; the regional suggestion neural network detects pedestrians in the video frame and generates a pedestrian detection frame;

1.3 on the basis of pedestrian detection, using a target detection network Faster R-CNN to detect and identify different body parts of a target;

firstly, extracting a characteristic diagram of a key frame in a video by using a classical convolutional neural network, wherein a network layer and a full connection layer share a subsequent candidate frame of the characteristic diagram; then, the candidate frame generating network layer generates the offset of the foreground candidate frame, the background candidate frame and the frame regression through convolution respectively; judging whether the candidate frame belongs to the foreground or the background or not through softmax, correcting the size and the position of the candidate frame by utilizing frame regression, and finally obtaining an accurate candidate frame; the candidate box generates the loss function of the network layer as follows:

Equal to 1, otherwise equal to 0; t is t _i Is a vector representing 4 parameterized coordinates of the prediction bounding box，/>

Representing the exact position of the positive candidate frame, L _cls Representing the loss function of the classification, L _reg A loss function representing regression; n (N) _cls And N _reg Belonging to normalization parameters;

then, collecting the feature images and the candidate frames by using a happy regional pooling layer, extracting the feature images of the candidate frames obtained from the RPN network after integrating the information, and sending the feature images to a subsequent full-connection layer to judge the target category; and finally, calculating the category in the feature map by utilizing the feature map of the candidate frame, and simultaneously obtaining the final accurate position of the detection frame again by using the boundary regression.

2. The progressive object fine identification and description method of claim 1, wherein,

aiming at the second step, a multi-level depth feature extraction algorithm based on the component is adopted to extract multi-level depth features based on the component for the same target; the method comprises the steps of adding part information of an object to a category label from a plurality of granularity layers, extracting features of the object in different granularity layers by using a deep learning method, and outputting the multi-granularity features of the part with the category label as a core to help describe a video target.

3. The progressive object fine identification and description method of claim 1, wherein the step two: multi-granularity feature extraction

2.1, based on the detection result of the pedestrian component, adopting an age and sex identification network to identify the age and sex of the pedestrian;

2.2, based on the detection result of the pedestrian component, adopting a clothing class network to identify the clothing class of the pedestrian;

2.3, based on the detection result of the pedestrian component, using a method based on the maximum color domain identification to identify the clothing basic color of the pedestrian body component;

specifically, a clothing color recognition module is adopted to recognize the clothing color details of pedestrians; the detected head, upper body and lower body of the pedestrian are used as the input of the MCI, and the steps are as follows:

a. converting the picture color into HSV;

c. performing binarization treatment on the filtered color;

d. performing morphological corrosion expansion of the image;

e. counting the area of a white area, wherein the largest area is the largest color gamut of the object;

4. A progressive object refinement identification and description method as defined in claim 3 in which for multi-classification problems, a readily occurring class confusion problem is employed with an LDA-based label combining algorithm:

dividing apparel categories into several topics, d, θ for each picture _d ＝{pt ₁ ，pt ₂ ，...，pt _k -theme distribution, pt, representing picture d _k Representing the probability that this picture d belongs to the topic k; pt (pt) _k The calculation formula of (2) is as follows:

pt _k ＝nt _k /n _d (4)

wherein nt _k Representing the number of pictures with a topic k, n _d Representing the number of pictures d to divide apparel categories into several large labels; decomposing the training set into label subsets according to the topic obtained by the LDA algorithm, and respectively training the convolutional neural network to obtain label sub-networks; then outputting a predicted probability value for the original clothing category by a first layer of base network of the cascade model through the idea of cascade classification; the cascade second layer is a label sub-network; the algorithm flow of the cascade network is as follows:

L＝{L ₁ ，L ₂ ，…，L _N } (5)

wherein N is the number of clothes categories; prediction probability P:

P＝{P ₁ ，P ₂ ，...，P _N } (6)

Then, the idea of probability coverage is adopted, namely the second prediction probability is taken as the final probability; for the category which does not accept the secondary prediction, taking the prediction probability of the first layer of the model as the final probability; finally, accurately predicting the clothing category; probability p _i The calculation formula is as follows:

5. the progressive object fine identification and description method of claim 1, wherein,

aiming at the third step, the features of different layers are combined corresponding to different grain layers based on a template matching method, so that multi-granularity information of all components is fused, and a structured video target fine description text is generated.

6. The progressive object fine recognition and description method of claim 5, wherein the template-based matching method is adopted: firstly, defining different templates according to different body parts of the target identified in the first step; and then, integrating the multi-granularity characteristics obtained in the step two into templates according to the information of the coarse granularity and the fine granularity obtained by the classifier according to the difference of the coarse granularity class information templates, and finally generating text description information of the video target.