CN109919106B - Progressive target fine recognition and description method - Google Patents

Progressive target fine recognition and description method Download PDF

Info

Publication number
CN109919106B
CN109919106B CN201910181642.9A CN201910181642A CN109919106B CN 109919106 B CN109919106 B CN 109919106B CN 201910181642 A CN201910181642 A CN 201910181642A CN 109919106 B CN109919106 B CN 109919106B
Authority
CN
China
Prior art keywords
target
video
granularity
network
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910181642.9A
Other languages
Chinese (zh)
Other versions
CN109919106A (en
Inventor
卫志华
沈雯
张彬彬
崔昊人
李倩文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201910181642.9A priority Critical patent/CN109919106B/en
Publication of CN109919106A publication Critical patent/CN109919106A/en
Application granted granted Critical
Publication of CN109919106B publication Critical patent/CN109919106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a progressive target fine recognition and description method, which takes video target recognition as a background and develops research work from a theory and a method of video characteristic multi-level acquisition and progressive target fine recognition and description. Firstly, detecting and dividing a video target so as to identify each component of the target; then, further extracting multi-granularity features of the video object based on the component identification; and finally, fusing the multi-granularity characteristics to realize the fine recognition of the target and generating fine description text information. The invention establishes a multi-level depth feature extraction method based on parts by simulating a method for recognizing and describing images by human, and provides effective theory and method for extracting video target features; a video target refined description method based on template matching is constructed through a natural language processing technology, and a new idea is provided for multi-level video target identification and description. The invention enriches and expands machine learning theory and method.

Description

Progressive target fine recognition and description method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a method for finely identifying and describing video targets.
Background
With the continuous popularization of video equipment and the increasing maturity of video monitoring technology, video monitoring is applied more and more widely, and the monitored video data volume presents explosive growth, and has become an important data object in the big data age. For example, millions of monitoring probes throughout Shanghai city generate TB-level video data every minute, providing valuable video resources for real-time mastering of social dynamics and ensuring public safety.
However, the unstructured nature of the video data itself makes its processing and analysis relatively difficult. At present, the target identification of video data is mainly manually analyzed, and a simple intelligent analysis means is used, so that the bottleneck of mass video data target identification such as 'video data in, target cannot be found', 'finding is achieved, finding is too long' exists. Meanwhile, the existing intelligent video analysis means have the problems of inaccurate identification, non-uniform feature description method and the like, and the problems severely restrict the further development and application of the video target identification technology. Therefore, how to realize the refined video target feature representation is a key problem to be solved in the intelligent analysis of videos facing to massive video big data.
Converting video information into text information characterizing a detection target is an effective way to solve the above-described problems. Video representation research based on this type of approach is mostly based on two types of approaches: (1) video object annotation: automatically adding category labels for objects in the video based on a machine learning algorithm, and representing video targets by the category labels; (2) video object understanding: based on computer vision and natural language understanding technology, natural language description of a video object is formed by extracting local features of objects in the video. The description of the video by the video target annotation is single, and the description of the object characteristics and the relevance among the objects is lacking; although the video target understanding may contain more information, the real scene is complex and changeable, and is difficult to uniformly define, so that a certain effect can be obtained only in a specific scene at present, and the video target understanding cannot serve the practical application.
Thus, the existence of these problems has led to intelligent analysis of video at a lower level. Aiming at the problems that the existing video target recognition method is marked with a single label, the spatial relationship of each component is difficult to accurately define and describe, and the like, a method capable of realizing fine recognition on targets in complex scenes is needed.
Disclosure of Invention
The invention aims to disclose a progressive target fine recognition and description method, aiming at the problems and difficulties existing in the current video monitoring, the research work is developed around the multi-level depth feature extraction and the fine target recognition and description of the video target. Mainly comprises three steps:
step one: component identification
Detecting and segmenting the video object to identify individual components of the object;
step two: multi-granularity feature extraction
Further extracting multi-granularity features of the video object based on the component identification;
step three: detailed description
And fusing the multi-granularity characteristics to realize the fine recognition of the target and generating fine description text information.
Aiming at the second step, the invention discloses a multi-level depth feature extraction algorithm based on a component, which is characterized in that the multi-level depth feature can be extracted from the same target based on the component. The multi-level features are that the component information of the object is added on the category label from a plurality of granularity layers, and the features of the depth features on different granularity layers are extracted by using a deep learning method. The algorithm aims at outputting multi-granularity characteristics of the components with category labels as cores to help describe video targets.
Aiming at the step three, the invention discloses a video target refined description algorithm based on template matching, which is characterized in that the video target refined description algorithm is a multi-granularity characteristic representation model of a video component, the characteristics of different layers are corresponding to different grain layers, and an information merging mechanism between the different grain layers is designed. The algorithm aims at fusing the multi-granularity information of the components and generating a structured video object fine description text.
The invention discloses a progressive target fine recognition and description method, which comprises the following specific implementation steps:
step one: component identification
1.1, extracting key frames of the collected video to generate a key frame image training set;
1.2 training a key frame image set by using a deep learning method, and detecting all targets in the key frame by using a regional advice neural network (Faster R-CNN);
1.3 based on the target detection result, using a region suggestion neural network (Faster R-CNN) based on real-time target detection to perform component detection on the target, and obtaining a target component image set.
Step two: multi-granularity feature extraction
2.1 extracting facial visual features based on a target head part using a Convolutional Neural Network (CNN) to perform age identification and gender identification on the target;
2.2 performing apparel category identification of the body part based on the target body part using the CNN extracted part coarse granularity feature;
2.3 based on the target body part, extracting fine-grained feature-maximum color domain feature of the target part image to perform clothing basic color recognition of the body part on the target.
Step three: detailed description
And 3.1, fusing the multi-granularity characteristics of the target component obtained in the step two, and generating a description sentence for precisely identifying the video target by using a natural language processing technology.
Advantageous effects
1. Aiming at the problems that the labeling of the existing video target recognition method is single, the accurate description of each part is difficult, and the like, the invention provides a progressive target fine recognition and description method which simultaneously learns and marks the multi-granularity information of the parts of a plurality of objects in the same video.
2. The invention uses a multi-level depth feature extraction algorithm based on the parts, and more accurately discovers the detail features of each body part of the target based on the multi-granularity features of the target body part segmentation extraction parts, so that the video target description is not limited to the analysis on the whole visual angle, and the accuracy and the richness of the target refined description content are ensured.
3. The invention uses a video target refined description algorithm based on template matching, fuses multi-granularity characteristics of video target components, establishes a multi-granularity video representation theory and method, and provides a new idea for video content representation. Meanwhile, by combining a natural language processing technology, text information with more complete video target descriptive performance is formed.
4. The invention enriches and expands machine learning theory and method, and simultaneously lays theory and application foundation for pushing intelligent analysis and development of video in the future.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and explain the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a progressive object refinement recognition algorithm
FIG. 2 is a schematic diagram of a progressive object refinement recognition algorithm
FIG. 3 target part detection recognition deep learning model map
FIG. 4 is a schematic representation of multi-level depth feature extraction and representation based on features
FIG. 5 target refinement identification example graph
FIG. 6 is a schematic diagram
Detailed Description
The following detailed description of embodiments of the present invention will be given with reference to the accompanying drawings and examples, by which the technical means of the present invention are applied to solve the technical problems and achieve the technical effects, and the implementation process can be fully understood and implemented accordingly.
The invention discloses a progressive target fine recognition and description method which is characterized in that a video target recognition is used as a background, and research work is carried out on theories and methods for extracting and refining target recognition and description from multi-level depth features of the video target. Firstly, detecting and dividing a video target so as to identify each component of the target; secondly, after the segmentation and the identification of the parts are completed, the multi-granularity characteristics of different parts are further extracted; and finally, fusing the multi-granularity characteristics to realize the fine recognition and description of the target and generating fine description text information.
The invention discloses a progressive target fine recognition and description method, which comprises the following steps:
step one: component identification
1.1 in this embodiment, the video is derived from a monitoring video of an important traffic gate in Shanghai city, and the resolution of the video is 1280 x 720p. The monitoring video scene is complex and comprises target pedestrians with different forms and different sizes. Firstly, extracting key frames from an acquired monitoring video. The key frames are required to meet the requirements of pedestrians, and the pedestrians are rich in form and size; and then generating a key frame image set and dividing the key frame image set into a training set, a verification set and a test set according to the proportion of 8:1:1.
1.2 training a keyframe image set using a deep learning method, constructing a regional advice neural network (Faster R-CNN). In the training process, the set output category is the video target pedestrian. The region suggestion neural network (Faster R-CNN) detects pedestrians in the video frame and generates a pedestrian detection box.
1.3 on the basis of pedestrian detection, different body parts of the target, such as specifically the head, upper and lower body parts, etc. of the same object are detected and identified using a target detection network, faster R-CNN, the network model being shown in FIG. 3.
Firstly, extracting a characteristic diagram of a key frame in a video by using a classical convolutional neural network, wherein a network layer (Region Proposal Network, RPN) and a full connection layer share are generated by a subsequent candidate frame of the characteristic diagram; then, the candidate frame generating network layer generates offset of foreground candidate frame, background candidate frame and frame regression through 3X 3 convolution respectively; judging whether the candidate frame belongs to the foreground or the background or not through softmax, correcting the size and the position of the candidate frame by utilizing frame regression, and finally obtaining the accurate candidate frame. The candidate box generates the loss function of the network layer as follows:
Figure BDA0001991506370000051
where i represents the index of the candidate box in the widget, p i Representing the probability that the candidate box i is predicted as an object, if the candidate box is a positive example, then
Figure BDA0001991506370000052
Equal to 1, otherwise equal to0。t i Is a vector representing 4 parameterized coordinates of the prediction bounding box,/for>
Figure BDA0001991506370000053
Representing the exact position of the positive candidate frame, L cls Representing the loss function of the classification, L reg Representing the regression loss function. N (N) cls And N reg Belonging to the normalization parameter.
And then, collecting the feature map and the candidate frames by using a happy regional pooling layer, extracting the feature map of the candidate frames obtained from the RPN network after integrating the information, and sending the feature map and the candidate frames to a subsequent full-connection layer to judge the target category. And finally, calculating the category in the feature map by utilizing the feature map of the candidate frame, and simultaneously obtaining the final accurate position of the detection frame again by using the boundary regression.
Through Faster RCNN, detection, identification and segmentation of target components are realized.
Step two: multi-granularity feature extraction
2.1 based on the pedestrian component detection result, the Age and sex of the pedestrian are identified by using an Age-sex identification network (age_ Gender Identification Network, AGI-Net). And taking the detected pedestrian head part as an input of the AGI-Net, and extracting facial features as the basis of age and gender identification. As shown in fig. 4.
The AGI-Net provided by the invention predicts the age and sex of a target by an improved FaceNet network structure (VGG-16 structure) on an input face image. In the AGI-Net model, the sex identification part is a two-class model whose identification result is male or female. Defining g as the true gender of the label, and g as the predicted gender of the model output, the formula of the loss function is defined as follows:
Loss=-[g log g*+(1-g)log(1-g * )] (4)
the age identification part is used for quantitatively estimating the ages, and supposing that the age value is discretized into Y age ranges, each age range Y i Age of coverage is Y i min ~Y i max Voting method is used in training samples to predict age range Y i Number y i
Age range |y| needs to satisfy: (a) A uniform range, i.e., wherein each age range covers the same number of years; (b) The balance range, i.e., the training samples covered by each age range, is approximately the same. Training the CNN in this way to classify age groups, the probability formula of the normalized output of the Y|age range neurons by softmax is:
Figure BDA0001991506370000061
where o= {1,2, |y| } is the output layer of the |y| dimension, O i E O is the output probability of age range neuron i normalized by softmax.
2.2 based on the pedestrian component detection results, the clothing category network (close-Detection Network, CD-Net) was used to identify the clothing category of the pedestrian. And taking the detected head, upper body and lower body of the pedestrian as the input of the CD-Net, and learning the clothing characteristics of the parts to identify the clothing types of the body parts. As shown in fig. 4.
The CD-Net provided by the invention extracts depth features through a convolution layer and a pooling layer from a segmented target body part, and finally maps feature vectors to a vector with the same dimension as the number of the classes through a full connection layer, thereby obtaining the classes of the clothing.
For multi-classification problems, the problem of class confusion that easily occurs, we use a label combining algorithm based on LDA. Dividing apparel categories into several topics, d, θ for each picture d ={pt 1 ,pt 2 ,...,pt k -theme distribution, pt, representing picture d k Representing the probability that this picture d belongs to the topic k. pt (pt) k The calculation formula of (2) is as follows:
pt k =nt k /n d (4)
wherein nt k Representing the number of pictures with a topic k, n d The number of representation pictures d divides the apparel class into several large tags. The training set is decomposed into label subsets according to the topic obtained by the LDA algorithm, and convolutional neural networks are trained respectively to obtain labelsA sub-network. And then outputting a predicted probability value for the original clothing category by using a first layer base network (BM-ConvNet) of the cascade model through the concept of cascade classification. The second layer of cascade is label sub-network (LDA-ConvNet-k, k is the number of subjects obtained by LDA algorithm). The algorithm flow of the cascade network is as follows:
step 6.1, inputting a sample into a first layer base network BM-ConvNet to obtain a predicted result L:
L={L 1 ,L 2 ,...,L N } (5)
wherein N is the number of clothes categories. Prediction probability P:
P={P 1 ,P 2 ,...,P N } (6)
step 6.2 for a predicted probability value greater than the threshold value P min Category L of (2) i Inputting the test sample into a sub-network LDA-ConvNet-k corresponding to the second layer to obtain a predicted result l:
Figure BDA0001991506370000071
wherein M is the predicted category number of LDA-ConvNet-k and the predicted probability p thereof k
Figure BDA0001991506370000072
Then, the idea of probability coverage is adopted, namely the second prediction probability is taken as the final probability; for the class which does not accept the secondary prediction, the prediction probability of the first layer of the model is taken as the final probability. And finally, accurately predicting the clothes category. Probability p i The calculation formula is as follows:
Figure BDA0001991506370000073
2.3 based on the pedestrian component detection result, the clothing basic color identification of the pedestrian body component is performed by using a method based on the maximum color domain identification. In particular, a garment color identification module (Module of Color Identification, MCI) is employed to identify the garment color details of a pedestrian. The detected head, upper body and lower body of the pedestrian are used as the input of the MCI, and the steps are as follows:
a. converting the picture color into HSV;
b. defining an HSV color dictionary with reference to an HSV color classification;
c. performing binarization treatment on the filtered color;
d. performing morphological corrosion expansion of the image;
e. and counting the area of the white area, wherein the largest area is the largest color gamut of the object.
The calculated maximum color gamut of the component can be used as a basis for the color recognition of the target body component.
Step three: detailed description
3.1 in this example, the target is described using a refined description algorithm based on template matching.
And (3) fusing the multi-granularity characteristics of the target component detected in the step two by adopting a natural language processing technology. In this embodiment, a template matching-based method is adopted. Firstly, defining different templates according to different body parts of the target identified in the first step; and then, integrating the multi-granularity characteristics obtained in the step two into templates according to the information of the coarse granularity and the fine granularity obtained by the classifier according to the difference of the coarse granularity class information templates, and finally generating text description information of the video target.
An overall schematic of this embodiment is shown in fig. 6.
While the foregoing description shows and describes several embodiments of the invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept described herein, either as a result of the foregoing teachings or as a result of the knowledge or technology of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.
Innovation point
The project is characterized in that from the actual demand of intelligent video analysis, the target component level and the target level in the video are gradually and progressively identified, so that the fine description of the target is achieved, and further, more accurate video target detection and identification are realized. The method is a comprehensive video target recognition path, forms rich target and feature description of parts thereof while carrying out target detection, and has better interpretability and descriptive property for practical application.
One of the innovations is: multi-granularity feature extraction and representation of target components
The project breaks through the traditional target recognition mode, the characteristics of the video target parts on different granularity are extracted from a plurality of granularity layers through a multi-level depth characteristic extraction algorithm based on the parts, and the target multi-granularity part characteristics can further provide support for the fine recognition of the targets.
And the innovation is as follows: progressive fine recognition and description of video objects
Conventional video object recognition often only provides global information of the object as descriptive information, ignoring component level detail features of the object, such as spatial location of the video object or face recognition of the video object. The project is intended to detect video object features from four granularity levels of the object's age, gender, body part clothing color, and body part clothing category. And (3) corresponding the features of different layers to different grain layers through a video target refinement description algorithm based on template matching, and designing an information merging mechanism between the different grain layers. The algorithm fuses the multi-granularity information of the components, generates a structured video target fine description text, provides a feasible solution for the depth analysis of the video target, and can better meet the actual requirements of the detection and analysis of the video target.

Claims (6)

1. A progressive target fine recognition and description method is characterized by comprising the following steps of
Step one: component identification
Detecting and segmenting the video object to identify individual components of the object;
step two: multi-granularity feature extraction
Further extracting multi-granularity features of the video object based on the component identification;
step three: detailed description
Fusing multi-granularity characteristics to realize fine recognition of the target and generating fine description text information;
the first step is as follows: component identification includes:
firstly, extracting key frames of an acquired monitoring video;
training a key frame image set by using a deep learning method, and constructing a regional suggestion neural network; in the training process, the set output category is the video target pedestrian; the regional suggestion neural network detects pedestrians in the video frame and generates a pedestrian detection frame;
1.3 on the basis of pedestrian detection, using a target detection network Faster R-CNN to detect and identify different body parts of a target;
firstly, extracting a characteristic diagram of a key frame in a video by using a classical convolutional neural network, wherein a network layer and a full connection layer share a subsequent candidate frame of the characteristic diagram; then, the candidate frame generating network layer generates the offset of the foreground candidate frame, the background candidate frame and the frame regression through convolution respectively; judging whether the candidate frame belongs to the foreground or the background or not through softmax, correcting the size and the position of the candidate frame by utilizing frame regression, and finally obtaining an accurate candidate frame; the candidate box generates the loss function of the network layer as follows:
Figure FDA0004072611000000011
where i represents the index of the candidate box in the widget, p i Representing the probability that the candidate box i is predicted as an object, if the candidate box is a positive example, then
Figure FDA0004072611000000021
Equal to 1, otherwise equal to 0; t is t i Is a vector representing 4 parameterized coordinates of the prediction bounding box,/>
Figure FDA0004072611000000022
Representing the exact position of the positive candidate frame, L cls Representing the loss function of the classification, L reg A loss function representing regression; n (N) cls And N reg Belonging to normalization parameters;
then, collecting the feature images and the candidate frames by using a happy regional pooling layer, extracting the feature images of the candidate frames obtained from the RPN network after integrating the information, and sending the feature images to a subsequent full-connection layer to judge the target category; and finally, calculating the category in the feature map by utilizing the feature map of the candidate frame, and simultaneously obtaining the final accurate position of the detection frame again by using the boundary regression.
2. The progressive object fine identification and description method of claim 1, wherein,
aiming at the second step, a multi-level depth feature extraction algorithm based on the component is adopted to extract multi-level depth features based on the component for the same target; the method comprises the steps of adding part information of an object to a category label from a plurality of granularity layers, extracting features of the object in different granularity layers by using a deep learning method, and outputting the multi-granularity features of the part with the category label as a core to help describe a video target.
3. The progressive object fine identification and description method of claim 1, wherein the step two: multi-granularity feature extraction
2.1, based on the detection result of the pedestrian component, adopting an age and sex identification network to identify the age and sex of the pedestrian;
2.2, based on the detection result of the pedestrian component, adopting a clothing class network to identify the clothing class of the pedestrian;
2.3, based on the detection result of the pedestrian component, using a method based on the maximum color domain identification to identify the clothing basic color of the pedestrian body component;
specifically, a clothing color recognition module is adopted to recognize the clothing color details of pedestrians; the detected head, upper body and lower body of the pedestrian are used as the input of the MCI, and the steps are as follows:
a. converting the picture color into HSV;
b. defining an HSV color dictionary with reference to an HSV color classification;
c. performing binarization treatment on the filtered color;
d. performing morphological corrosion expansion of the image;
e. counting the area of a white area, wherein the largest area is the largest color gamut of the object;
the calculated maximum color gamut of the component can be used as a basis for the color recognition of the target body component.
4. A progressive object refinement identification and description method as defined in claim 3 in which for multi-classification problems, a readily occurring class confusion problem is employed with an LDA-based label combining algorithm:
dividing apparel categories into several topics, d, θ for each picture d ={pt 1 ,pt 2 ,...,pt k -theme distribution, pt, representing picture d k Representing the probability that this picture d belongs to the topic k; pt (pt) k The calculation formula of (2) is as follows:
pt k =nt k /n d (4)
wherein nt k Representing the number of pictures with a topic k, n d Representing the number of pictures d to divide apparel categories into several large labels; decomposing the training set into label subsets according to the topic obtained by the LDA algorithm, and respectively training the convolutional neural network to obtain label sub-networks; then outputting a predicted probability value for the original clothing category by a first layer of base network of the cascade model through the idea of cascade classification; the cascade second layer is a label sub-network; the algorithm flow of the cascade network is as follows:
step 6.1, inputting a sample into a first layer base network BM-ConvNet to obtain a predicted result L:
L={L 1 ,L 2 ,…,L N } (5)
wherein N is the number of clothes categories; prediction probability P:
P={P 1 ,P 2 ,...,P N } (6)
step 6.2 for a predicted probability value greater than the threshold value P min Category L of (2) i Inputting the test sample into a sub-network LDA-ConvNet-k corresponding to the second layer to obtain a predicted result l:
Figure FDA0004072611000000041
wherein M is the predicted category number of LDA-ConvNet-k and the predicted probability p thereof k
Figure FDA0004072611000000042
Then, the idea of probability coverage is adopted, namely the second prediction probability is taken as the final probability; for the category which does not accept the secondary prediction, taking the prediction probability of the first layer of the model as the final probability; finally, accurately predicting the clothing category; probability p i The calculation formula is as follows:
Figure FDA0004072611000000043
5. the progressive object fine identification and description method of claim 1, wherein,
aiming at the third step, the features of different layers are combined corresponding to different grain layers based on a template matching method, so that multi-granularity information of all components is fused, and a structured video target fine description text is generated.
6. The progressive object fine recognition and description method of claim 5, wherein the template-based matching method is adopted: firstly, defining different templates according to different body parts of the target identified in the first step; and then, integrating the multi-granularity characteristics obtained in the step two into templates according to the information of the coarse granularity and the fine granularity obtained by the classifier according to the difference of the coarse granularity class information templates, and finally generating text description information of the video target.
CN201910181642.9A 2019-03-11 2019-03-11 Progressive target fine recognition and description method Active CN109919106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910181642.9A CN109919106B (en) 2019-03-11 2019-03-11 Progressive target fine recognition and description method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910181642.9A CN109919106B (en) 2019-03-11 2019-03-11 Progressive target fine recognition and description method

Publications (2)

Publication Number Publication Date
CN109919106A CN109919106A (en) 2019-06-21
CN109919106B true CN109919106B (en) 2023-05-12

Family

ID=66964148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910181642.9A Active CN109919106B (en) 2019-03-11 2019-03-11 Progressive target fine recognition and description method

Country Status (1)

Country Link
CN (1) CN109919106B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110533079B (en) * 2019-08-05 2022-05-24 贝壳技术有限公司 Method, apparatus, medium, and electronic device for forming image sample
CN111126373A (en) * 2019-12-23 2020-05-08 北京中科神探科技有限公司 Internet short video violation judgment device and method based on cross-modal identification technology
CN111401289B (en) * 2020-03-24 2024-01-23 国网上海市电力公司 Intelligent identification method and device for transformer component
CN111860620A (en) * 2020-07-02 2020-10-30 苏州富鑫林光电科技有限公司 Multilayer hierarchical neural network architecture system for deep learning
CN112488241B (en) * 2020-12-18 2022-04-19 贵州大学 Zero sample picture identification method based on multi-granularity fusion network
CN112926569B (en) * 2021-03-16 2022-10-18 重庆邮电大学 Method for detecting natural scene image text in social network
CN113989857B (en) * 2021-12-27 2022-03-18 四川新网银行股份有限公司 Portrait photo content analysis method and system based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982350A (en) * 2012-11-13 2013-03-20 上海交通大学 Station caption detection method based on color and gradient histograms
CN105654104A (en) * 2014-11-28 2016-06-08 无锡慧眼电子科技有限公司 Pedestrian detection method based on multi-granularity feature
CN106658169A (en) * 2016-12-18 2017-05-10 北京工业大学 Universal method for segmenting video news in multi-layered manner based on deep learning
CN107133569A (en) * 2017-04-06 2017-09-05 同济大学 The many granularity mask methods of monitor video based on extensive Multi-label learning
CN108510000A (en) * 2018-03-30 2018-09-07 北京工商大学 The detection and recognition methods of pedestrian's fine granularity attribute under complex scene
CN109344774A (en) * 2018-10-08 2019-02-15 国网经济技术研究院有限公司 Heat power station target identification method in remote sensing image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982350A (en) * 2012-11-13 2013-03-20 上海交通大学 Station caption detection method based on color and gradient histograms
CN105654104A (en) * 2014-11-28 2016-06-08 无锡慧眼电子科技有限公司 Pedestrian detection method based on multi-granularity feature
CN106658169A (en) * 2016-12-18 2017-05-10 北京工业大学 Universal method for segmenting video news in multi-layered manner based on deep learning
CN107133569A (en) * 2017-04-06 2017-09-05 同济大学 The many granularity mask methods of monitor video based on extensive Multi-label learning
CN108510000A (en) * 2018-03-30 2018-09-07 北京工商大学 The detection and recognition methods of pedestrian's fine granularity attribute under complex scene
CN109344774A (en) * 2018-10-08 2019-02-15 国网经济技术研究院有限公司 Heat power station target identification method in remote sensing image

Also Published As

Publication number Publication date
CN109919106A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109919106B (en) Progressive target fine recognition and description method
Yang et al. Visual sentiment prediction based on automatic discovery of affective regions
CN107133569B (en) Monitoring video multi-granularity labeling method based on generalized multi-label learning
CN109002834B (en) Fine-grained image classification method based on multi-modal representation
Liu et al. Open-world semantic segmentation via contrasting and clustering vision-language embedding
CN104778470B (en) Text detection based on component tree and Hough forest and recognition methods
CN108171184A (en) Method for distinguishing is known based on Siamese networks again for pedestrian
Han et al. Convolutional edge constraint-based U-net for salient object detection
Luo et al. SFA: small faces attention face detector
Rabiee et al. Crowd behavior representation: an attribute-based approach
CN109858570A (en) Image classification method and system, computer equipment and medium
CN109086794B (en) Driving behavior pattern recognition method based on T-LDA topic model
Chen et al. Salient object detection: integrate salient features in the deep learning framework
CN112990282B (en) Classification method and device for fine-granularity small sample images
CN111898528B (en) Data processing method, device, computer readable medium and electronic equipment
Rakowski et al. Hand shape recognition using very deep convolutional neural networks
CN112750128B (en) Image semantic segmentation method, device, terminal and readable storage medium
Ma et al. A recognition method of hand gesture with CNN-SVM model
Al-Obodi et al. A Saudi Sign Language recognition system based on convolutional neural networks
CN115457620A (en) User expression recognition method and device, computer equipment and storage medium
Torres A framework for the unsupervised and semi-supervised analysis of visual frames
Yao et al. Extracting robust distribution using adaptive Gaussian Mixture Model and online feature selection
Desai et al. Automatic visual sentiment analysis with convolution neural network
CN113553947A (en) Method and device for re-identifying pedestrian by embedding multi-mode into generation description and electronic equipment
CN112598056A (en) Software identification method based on screen monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant