CN112580544A - Image recognition method, device and medium and electronic equipment thereof - Google Patents

Image recognition method, device and medium and electronic equipment thereof Download PDF

Info

Publication number
CN112580544A
CN112580544A CN202011550542.8A CN202011550542A CN112580544A CN 112580544 A CN112580544 A CN 112580544A CN 202011550542 A CN202011550542 A CN 202011550542A CN 112580544 A CN112580544 A CN 112580544A
Authority
CN
China
Prior art keywords
pedestrian
image
human body
recognized
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011550542.8A
Other languages
Chinese (zh)
Inventor
万晏辰
陈云鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yitu Network Science and Technology Co Ltd
Original Assignee
Shanghai Yitu Network Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yitu Network Science and Technology Co Ltd filed Critical Shanghai Yitu Network Science and Technology Co Ltd
Priority to CN202011550542.8A priority Critical patent/CN112580544A/en
Publication of CN112580544A publication Critical patent/CN112580544A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Traffic Control Systems (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to the field of computer vision, and in particular, to an image recognition method, an image recognition device, a medium, and an electronic device. The image recognition method of the present application includes: the method comprises the steps of obtaining an image of a pedestrian to be recognized, obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, generating a human body part enhancement diagram of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic diagram, determining a plurality of semantic information related to the body of the pedestrian to be recognized in the human body part enhancement diagram and a human body feature vector corresponding to each semantic information, determining whether the image of the pedestrian to be recognized is an image of a target pedestrian or not, fusing the image of the human body to be recognized and the human body thermodynamic diagram for image recognition, and improving the accuracy of image recognition.

Description

Image recognition method, device and medium and electronic equipment thereof
Technical Field
The present disclosure relates to the field of computer vision, and in particular, to an image recognition method, an image recognition device, a medium, and an electronic device.
Background
Pedestrian Re-identification (ReID), also known as pedestrian Re-identification, is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. That is, based on the pedestrian image in one image pickup apparatus, the pedestrian image picked up by the other image pickup apparatus is retrieved. For example, in an area where a plurality of image capturing apparatuses capture a video sequence, ReID may retrieve all images of a pedestrian captured by one image capturing apparatus, the pedestrian appearing under the other image capturing apparatuses.
A technology for realizing pedestrian re-identification is as follows: the image containing the pedestrian to be identified is horizontally segmented, for example, the image is divided into three parts, namely a head part, an upper half body and a lower half body, each part is subjected to feature extraction, and finally, the features of each part are respectively judged with the features of each part of the reference image. For example, as shown in fig. 1, the target pedestrian in the reference image 10 is in a standing state, and the pedestrian to be recognized in the image to be recognized 11 is in a sitting state. The reference image 10 is horizontally sliced, for example, the slice is divided into three parts: the head a1, the upper body a2, and the lower body a3 are each extracted as a feature. Similarly, the image to be recognized 11 is horizontally segmented, for example, the slice is divided into three parts: the large head part b1, the head and upper body part b2, and the lower body and surrounding object b3, for each part, features are extracted. Finally, the features of the parts in the image 11 to be recognized and the reference image 10 are compared and recognized respectively, for example, all the head features in the head a1 are matched with the large head b1 features, the features of the upper body a2 are matched with the features of the head and the upper body b2, and the features of the lower body a3 are matched with the features of the lower body and the surrounding object b 3. Since semantic information (for example, a head and an upper half of a body) represented by each slice portion of the image to be recognized 11 (for example, an image of the whole body) and the reference image 10 (for example, an image of the half of the body) are different, the accuracy of the algorithm is reduced due to forced matching, and the image recognition accuracy is low.
Disclosure of Invention
The embodiment of the application provides an image identification method, an image identification device, a medium and electronic equipment.
In a first aspect, an embodiment of the present application provides an image recognition method, including: acquiring an image of a pedestrian to be identified; obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body part thermodynamic diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized; generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic map; determining a plurality of semantic information related to the body of the pedestrian to be recognized in the human body part enhancement map and a human body feature vector corresponding to each semantic information, wherein the human body feature vector comprises an enhancement feature vector representing the body part of the pedestrian to be recognized; and matching the human body feature vector corresponding to each semantic information in the pedestrian image to be recognized with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the pedestrian image to be recognized is the image of the target pedestrian.
For example, the pedestrian image to be recognized may be a monitoring video including one or more frames of pedestrian images, the pedestrian image is acquired, the pedestrian image is subjected to posture detection to generate a human body thermodynamic diagram, the human body thermodynamic diagram includes semantic information of a human body and position information of human body parts, the human body semantic information includes human body parts such as a head, a face, an upper half, a left upper arm, a left lower arm, a right upper arm, a right lower arm, a left thigh, a left calf, a right thigh and a right calf, and the position information of the human body parts includes positions of the human body semantic information in the pedestrian image to be recognized. And generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic map.
It can be understood that the human body part enhancement map includes semantic information of a human body, position information of the human body, and appearance characteristics of the human body, and the feature vector of the human body is extracted from the human body part enhancement map, and then the human body feature vector includes enhancement feature vector representing the body part of the pedestrian to be recognized, that is, semantic information and position information including the feature vector of the human body. And matching the human body feature vector corresponding to each semantic information in the pedestrian image to be recognized with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the pedestrian image to be recognized is the image of the target pedestrian. It can be understood that a human body part enhancement map containing human body semantic information, human body position information and human body appearance features is generated by fusing a human body image to be recognized and a human body thermodynamic diagram, and the human body part enhancement map is subjected to image recognition, so that comparison between each part of a human body of a pedestrian image to be recognized and each part of a human body of a target pedestrian of a reference image can be realized, for example, a facial feature vector of the pedestrian to be recognized is compared with a facial feature vector of the target pedestrian, and a limb feature vector of the pedestrian to be recognized is compared with a limb feature vector of the target pedestrian. It can be understood that the human body image to be recognized and the human body thermodynamic diagram are fused for image recognition, so that the precision of an image recognition algorithm can be improved, and the accuracy of the image recognition can be improved.
In a possible implementation of the first aspect, the method further includes: matching the human body feature vector corresponding to each semantic information in the pedestrian image to be recognized with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the pedestrian image to be recognized is the image of the target pedestrian or not, wherein the steps of: obtaining a human body skeleton attention diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body skeleton attention diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized; and matching the human body feature vector fused with the feature vector of the pixel point of the human body skeleton attention diagram with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the image of the pedestrian to be identified is the image of the target pedestrian.
It can be understood that the human skeleton attention diagram is generated by performing posture detection on the pedestrian image, the human skeleton attention diagram comprises semantic information of a human body and position information of human body parts, the human body semantic information of the human skeleton attention diagram comprises the human body parts such as a head, a face, a trunk and four limbs, and the position information comprises the position of the semantic information of the human body in the pedestrian image to be recognized. And matching the human body feature vector fused with the feature vector of the pixel point of the human body skeleton attention diagram with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the image of the pedestrian to be identified is the image of the target pedestrian. It can be understood that the human body feature vector fuses the position information of the human body, and the pixels of the human body position in the generated human body feature vector have higher weight, which is beneficial to the extraction and analysis of the human body feature information. The human body part feature vector can be used for fusing the human body part information again by combining a human body skeleton attention map with a preset size, and the image processing unit has better computing power, simpler and more convenient computing process, time saving and better recognition effect by combining a human body skeleton attention map.
In a possible implementation of the first aspect, the method further includes: the semantic information of the to-be-recognized pedestrian image comprises a face, a human body characteristic vector corresponding to the face in the to-be-recognized pedestrian image is matched with a human body characteristic vector corresponding to the face of the target pedestrian in the reference image, and whether the to-be-recognized pedestrian image is the image of the target pedestrian is determined.
In a possible implementation of the first aspect, the method further includes: the semantic information of the to-be-identified pedestrian image comprises four limbs, the human body feature vector corresponding to the trunk in the to-be-identified pedestrian image is matched with the human body feature vector corresponding to the four limbs of the target pedestrian in the reference image, and whether the to-be-identified pedestrian image is the image of the target pedestrian is determined.
In a possible implementation of the first aspect, the method further includes: the human body feature vector also comprises a feature vector for representing the external attachment of the pedestrian to be identified, the feature vector for representing the external attachment of the pedestrian to be identified corresponding to each semantic information in the image of the pedestrian to be identified is matched with the feature vector for representing the external attachment of the pedestrian to be identified corresponding to each semantic information related to the target pedestrian in the reference image, and whether the image of the pedestrian to be identified is the image where the target pedestrian is located is determined.
For example, the feature vector for representing the external attachment of the human body comprises clothes features, ornament features and carrying object features. The method can be understood that the external fitted feature vector of the pedestrian to be identified can enrich the features of the pedestrian to be identified, match the external fitted feature vector representing the pedestrian to be identified corresponding to each semantic information in the image of the pedestrian to be identified with the external fitted feature vector representing the pedestrian to be identified corresponding to each semantic information related to the target pedestrian in the reference image, and determine whether the image of the pedestrian to be identified is the image of the target pedestrian, so that the accuracy of human body re-identification can be improved.
In a possible implementation of the first aspect, the method further includes: the method for generating the human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic diagram comprises the following steps: and merging the pedestrian image to be recognized and the human body part thermodynamic diagram through a concat algorithm to generate a human body part feature enhancement diagram.
In a possible implementation of the first aspect, the method further includes: the concat algorithm includes at least one of: a nearest neighbor interpolation algorithm, a bilinear interpolation algorithm, a mean interpolation algorithm, a median interpolation algorithm and an S-Spline algorithm.
In a second aspect, an embodiment of the present application provides an image recognition apparatus, including: the acquisition module is used for acquiring an image of a pedestrian to be identified; the part detection module is used for obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body part thermodynamic diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized; the enhancement module is used for generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic diagram; the feature extraction module is used for determining a plurality of semantic information related to the body of the pedestrian to be recognized in the human body part enhancement map and a human body feature vector corresponding to each semantic information, wherein the human body feature vector comprises an enhancement feature vector representing the body part of the pedestrian to be recognized; the identification module is used for matching the human body feature vector corresponding to each semantic information in the to-be-identified pedestrian image with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the to-be-identified pedestrian image is the image where the target pedestrian is located.
In a possible implementation of the second aspect, the apparatus further includes: matching the human body feature vector corresponding to each semantic information in the pedestrian image to be recognized with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the pedestrian image to be recognized is the image of the target pedestrian or not, wherein the steps of: obtaining a human body skeleton attention diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body skeleton attention diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized; and matching the human body feature vector fused with the feature vector of the pixel point of the human body skeleton attention diagram with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the image of the pedestrian to be identified is the image of the target pedestrian.
In a possible implementation of the second aspect, the apparatus further includes: the semantic information of the to-be-recognized pedestrian image comprises a face, a human body characteristic vector corresponding to the face in the to-be-recognized pedestrian image is matched with a human body characteristic vector corresponding to the face of the target pedestrian in the reference image, and whether the to-be-recognized pedestrian image is the image of the target pedestrian is determined.
In a possible implementation of the second aspect, the apparatus further includes: the semantic information of the to-be-identified pedestrian image comprises four limbs, the human body feature vector corresponding to the trunk in the to-be-identified pedestrian image is matched with the human body feature vector corresponding to the four limbs of the target pedestrian in the reference image, and whether the to-be-identified pedestrian image is the image of the target pedestrian is determined.
In a possible implementation of the second aspect, the apparatus further includes: the human body feature vector also comprises a feature vector for representing the external attachment of the pedestrian to be identified, the feature vector for representing the external attachment of the pedestrian to be identified corresponding to each semantic information in the image of the pedestrian to be identified is matched with the feature vector for representing the external attachment of the pedestrian to be identified corresponding to each semantic information related to the target pedestrian in the reference image, and whether the image of the pedestrian to be identified is the image where the target pedestrian is located is determined.
In a possible implementation of the second aspect, the apparatus further includes: the method for generating the human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic diagram comprises the following steps: and merging the pedestrian image to be recognized and the human body part thermodynamic diagram through a concat algorithm to generate a human body part feature enhancement diagram.
In a possible implementation of the second aspect, the apparatus further includes: the concat algorithm includes at least one of: a nearest neighbor interpolation algorithm, a bilinear interpolation algorithm, a mean interpolation algorithm, a median interpolation algorithm and an S-Spline algorithm.
In a third aspect, the present application provides a machine-readable medium, on which instructions are stored, and when executed on a machine, the instructions cause the machine to perform the image recognition method in the first aspect and possible implementations of the first aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device, including:
a memory for storing instructions for execution by one or more processors of the electronic device, an
The processor is one of processors of an electronic device, and is configured to execute the image recognition method in the first aspect and possible implementations of the first aspect.
Drawings
FIG. 1 illustrates a prior art image recognition schematic, according to some embodiments of the present application;
FIG. 2 illustrates a schematic diagram of a human body part thermodynamic diagram, according to some embodiments of the present application;
FIG. 3 illustrates a human skeletal attention map schematic, in accordance with some embodiments of the present application;
FIG. 4 illustrates an image reduction diagram, according to some embodiments of the present application;
FIG. 5 illustrates an image magnification schematic, according to some embodiments of the present application;
FIG. 6 illustrates an image-based recognition scene graph, according to some embodiments of the present application;
FIG. 7 illustrates a schematic diagram of a component architecture for image recognition, according to some embodiments of the present application;
FIG. 8 illustrates a flow diagram of a method of image recognition, according to some embodiments of the present application;
FIG. 9 illustrates a human body part enhancement pictorial illustration of a pedestrian image combined with a human body part thermodynamic diagram, in accordance with some embodiments of the present application;
FIG. 10 illustrates a block diagram of an image recognition device, according to some embodiments of the present application;
FIG. 11 illustrates a block diagram of an electronic device, in accordance with some embodiments of the present application;
fig. 12 illustrates a block diagram of a system on a chip (SoC), according to some embodiments of the present application.
Detailed Description
Illustrative embodiments of the present application include, but are not limited to, an image recognition method and an apparatus, medium, and electronic device thereof.
To solve the above-mentioned problems. In the embodiment of the application, the image to be recognized (for example, an image of the whole body) and the reference image (for example, an image of the half body) are matched by using the characteristics of the slice parts with the same semantic information (both of the semantic information and the head part), so that the accuracy of image recognition is improved. Specifically, the positions of the pedestrians in the image are obtained by using the pedestrian posture key points, and meanwhile, the positions of all parts of the human body are obtained based on the pedestrian posture key point connecting lines, so that the pedestrian posture key points and all parts of the human body are given greater attention weight, and therefore when the features are extracted from the pedestrian posture key points and all parts of the human body, richer human body features are obtained. And abundant human body features are matched with features of all slice parts with the same semantic information (all heads), so that the accuracy of pedestrian re-identification is improved.
Embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
For the convenience of understanding the technical solutions provided by the embodiments of the present application, the following key terms used in the embodiments of the present application are explained:
thermodynamic diagram of human body part: and the method is used for marking the regional position characteristics of the human body part in the image. The key parts of the human body can be displayed in a special highlight form.
For example, as shown in fig. 2, an image change diagram of a human body part thermodynamic diagram corresponding to 10 human body key parts is shown, which is obtained from a pedestrian image. Fig. 2(c) to 2(l) show thermodynamic diagrams of key parts of a human body, wherein fig. 2(c) is a human head thermodynamic diagram including a human head highlighted region c. Fig. 2(d) is a thermodynamic diagram of the upper half of the human body, fig. 2(e) is a thermodynamic diagram of the left upper arm of the human body, fig. 2(f) is a thermodynamic diagram of the left lower arm of the human body, fig. 2(g) is a thermodynamic diagram of the right upper arm of the human body, fig. 2(h) is a thermodynamic diagram of the right lower arm of the human body, fig. 2(i) is a thermodynamic diagram of the left upper leg of the human body, fig. 2(j) is a thermodynamic diagram of the left lower leg of the human body, fig. 2(k) is a thermodynamic diagram of the right upper leg of the human body, and fig..
Specifically, in some embodiments, the acquired pedestrian image is image processed to generate a key point thermodynamic diagram of the human body, for example, fig. 2(a) is the acquired pedestrian image, fig. 2(b) is the human body key point thermodynamic diagram, and the image processing of the pedestrian image in fig. 2(a) generates the key point thermodynamic diagram of the human body in fig. 2 (b). For example, a picture is subjected to detection of key points of a human body by using a spot detector, coordinates of 13 key points of the human body (head, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left crotch, right crotch, left knee, right knee, left ankle, right ankle) are obtained, a heatmap map of the key points is generated by gaussian distribution, significant key points are connected to generate a heatmap of each part of the human body, and specifically, 10 parts of the human body, namely, the head, upper body, left upper arm, left lower arm, right upper arm, right lower arm, left thigh, left calf, right thigh, and right calf are generated.
Human skeleton attention map: the method is used for marking the regional position characteristics of the skeleton architecture of the human body distributed in the image so as to highlight the skeleton architecture of the human body.
For example, fig. 3 illustrates an image change schematic diagram for deriving a human skeleton attention map from a pedestrian image, according to some embodiments of the present application. Fig. 3(c) is a human skeleton attention diagram, which is a human trunk distribution diagram as shown in fig. 3(c), and the human skeleton attention diagram includes an architecture formed by bone joints of a human body. For example, the skull M, N is the torso and Q is the limb. Specifically, in some embodiments, the image processing is performed on the acquired pedestrian image to generate a key point thermodynamic diagram of the human body, then the key points in the key point thermodynamic diagram of the human body are sequentially connected according to the distribution of the skeleton, and then the gaussian blur image processing is performed to generate an attention diagram of the human body skeleton.
For example, fig. 3(a) is an acquired pedestrian image, fig. 3(b) is a human body key point thermodynamic diagram, and as shown in fig. 3(b), the human body skeleton attention diagram includes 13 human body part key points, specifically, a head key point, a left shoulder key point, a right shoulder key point, a left hand key point, a right hand key point, a left crotch key point, a right crotch key point, a left knee key point, a right knee key point, a left foot key point, and a right foot key point. The key points of 13 human body parts in fig. 3(b) are connected and processed with gaussian blur to obtain the attention of the human skeleton in fig. 3 (c).
Image reduction: or called image downsampling (subsampled) or image downsampling (downsampled). In the embodiments of the present application, image reduction is mainly used for: the pedestrian image to be recognized is reduced to a preset size to meet the size requirement of the human appearance feature extraction model on the input image, the human appearance feature extraction model performs down-sampling on the input image to extract the appearance feature of the input image, the human skeleton attention image is reduced to a preset size, and the human skeleton attention image is combined with the human body part reinforcing layer.
For example, fig. 4 illustrates an image reduction diagram according to some embodiments of the present application. As shown in fig. 4, the image size of fig. 4(a) is 605 × 200, where 605 is the length of the image and 200 is the width of the image, and the image size of fig. 4(b) is 354 × 117, where 354 is the length of the image and 117 is the width of the image, by reducing the image of fig. 4 (a).
In embodiments of the present application, the image may be reduced or downsampled by an image reduction algorithm. There are many kinds of image reduction algorithms, such as nearest neighbor interpolation, bilinear interpolation, mean interpolation, median interpolation, S-Spline algorithm, etc.
Image amplification: or image upsampling, or image interpolation (interpolating). In the embodiment of the application, the image amplification is mainly used for amplifying the image of the pedestrian to be identified to the preset size so as to meet the size requirement of the human appearance characteristic extraction model on the input image.
For example, fig. 5 illustrates an image magnification schematic, according to some embodiments of the present application. As shown in fig. 5, the image size of fig. 5(a) is 151 × 50, where 151 is the length of the image and 50 is the width of the image, and the image of fig. 5(a) is enlarged to generate fig. 5(b), where 354 is the length of the image and 117 is the width of the image, and the image size of fig. 5(b) is 354 × 117.
In the embodiment of the application, the main purpose of image enlargement is to enlarge an original image to a fixed size, so as to meet the size requirement of a human appearance feature extraction model on an input image. There are many kinds of image enlarging algorithms, such as nearest neighbor interpolation, bilinear interpolation, mean interpolation, median interpolation, S-Spline algorithm, etc.
Some embodiments according to the present application disclose image recognition method scenarios. Fig. 6 shows a schematic diagram of this scenario. In the scenario shown in fig. 6, including the electronic device 600 and the plurality of image capturing devices 601-1 to 601-n, the electronic device 600 and the plurality of image capturing devices 601-1 to 601-n may establish a communication connection through a wired link or a wireless link, respectively.
The image capturing devices 601-1 to 601-n are configured to capture images and videos and send the captured images and videos to the electronic device 600. The electronic device 600 receives the images and videos sent by the image capturing devices 601-1 to 601-n, and recognizes the images and videos with the images to be recognized from the received images and videos.
It is understood that the electronic device 600 may be various devices having image or video processing capabilities, such as a Personal Computer (PC), a notebook computer, a server, or the like. The server may be an independent physical server, a server cluster formed by a plurality of physical servers, or a server providing basic cloud computing services such as a cloud database, a cloud storage, a CDN, and the like, and the scale of the server may be planned according to the number of videos to be processed, which is not limited in the embodiment of the present application.
The image capturing device 601-1 may be a device with a specific image and video capturing function, such as a monitoring camera, an unmanned aerial vehicle with a camera, and the like. The monitoring camera 601-1 may be a monitoring camera disposed in a place such as a mall, a road, or a subway entrance, and the monitoring camera may be used to capture videos of pedestrians in the places. In practical applications, a larger number of cameras 601-1 may be included in the image recognition scene, for example, a camera 601-1 disposed at a position of each floor in a shopping mall is included.
Further, it is understood that the camera 601-1 and the electronic device 600 may be communicatively coupled via one or more networks. The network may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, and certainly may also be other possible networks, which is not limited in this embodiment of the present application.
It is to be understood that the image recognition scenario shown in fig. 6 is only one exemplary scenario for implementing the embodiment of the present application, and the embodiment of the present application is not limited to the scenario shown in fig. 6. In other embodiments, the scenario illustrated in FIG. 6 may include more or fewer devices or components than the illustrated embodiment, or some components may be combined, some components may be split, or a different arrangement of components.
In some embodiments of the present application, the image recognition method may be performed by a component architecture for image recognition, which is executed in the electronic device 600. The following detailed description is made with reference to the accompanying drawings.
FIG. 7 shows a schematic diagram of a component architecture for image recognition, according to an embodiment of the present application. As shown in fig. 7, the composition framework 70 includes a human posture detection model 72, a human part enhancement module 73, a human appearance feature extraction model 74, a human part enhancement module 75, an image recognition module 76 and an image 71 of a pedestrian to be recognized. The portions of the composition architecture 70 of the image recognition scene are described in detail below.
Human body posture detection model 72: the method is used for detecting the gesture of the input pedestrian image 71 to be recognized and generating a human body key point thermodynamic diagram. Such as neck key points, elbow key points, wrist key points, shoulder key points, head key points, etc. In the embodiment of the application, the key points identified in the image are represented by the corresponding coordinate points of the joint in the image. It is understood that the positions of the human body joints in the image are known from the human body key point thermodynamic diagram, and whether the pedestrian in the image 71 of the pedestrian to be recognized is blocked, the relative positions of the parts of the pedestrian, and the like are further determined.
Specifically, the pedestrian image 71 to be recognized is input into the human body posture detection model 72, a human body key point thermodynamic diagram is generated, human body key points in the human body key point thermodynamic diagram are connected and subjected to Gaussian blur processing, and finally a thermodynamic diagram and a human body skeleton attention diagram of the human body part are generated.
Human body part enhancement module 73: the human body part enhancement map generation method is used for combining the pedestrian image 71 to be recognized with 10 human body part thermodynamic maps to generate a human body part enhancement map. It is understood that the human body part enhancement map contains both the appearance features of the human body to be recognized in the image 71 of the pedestrian to be recognized and the position information of the human body part to be recognized in the image.
Human appearance feature extraction model 74: the method is used for extracting the appearance characteristics of the human body part in the human body part enhancement image and generating the human body part feature vector. The human body part feature vector comprises a plurality of human body feature vectors, wherein each human body appearance feature is represented by one or more human body feature vectors. Specifically, the human appearance features may be facial features, limb features, gender features, clothing features, height features, and the like of the human body.
Human body part enhancement module 75: the human body skeleton attention map generation method is used for combining the human body part feature vector and the human body skeleton attention map to generate the human body feature vector. It can be understood that the human body feature vector includes both the position information of the portion of the pedestrian image and the appearance feature information of the human body. The human body part enhancing module 75 may make the appearance characteristics of the part of the human body more apparent. According to the human body feature vector, the image recognition module can analyze the appearance features of the human body part in the pedestrian image to be recognized in a targeted manner, and the accuracy of human body re-recognition is improved.
The image recognition module 76: the image recognition module 76 is used to identify images that match the reference image. The image recognition module 76, i.e. the classifier of the human appearance feature extraction model 74, is used to generate a prediction probability that the pedestrian image to be recognized matches with other pedestrian images in the video. Specifically, the classifier includes three fully-connected layers and one sigmoid regression layer. Wherein the three fully-connected layers include two hidden layers and one output layer. The sigmoid regression layer comprises a sigmoid function and a cross entropy loss function, the output vector of the output layer can obtain the prediction probability of matching the pedestrian image to be identified with other pedestrian images in the video through the sigmoid layer, and the cross entropy loss function is used for training the whole network.
The pedestrian image in the surveillance video captured by the camera 601-1 is taken as an example below. The image recognition method provided by the present application is described in detail with reference to fig. 2 to 8. FIG. 8 illustrates a flow diagram of a method of image recognition, according to some embodiments of the present application. Specifically, the method comprises the following steps:
step 802: and acquiring an image of the pedestrian to be identified.
It is understood that, in most cases, the pedestrian is in a movable state at any time, the posture of the pedestrian in the image is mostly inconsistent, and the body of the pedestrian may be covered by a part of the surrounding shielding object under the influence of the surrounding environment. Therefore, the image of the pedestrian to be recognized may include a partial body image or a full body image of the pedestrian.
It is understood that, in the embodiment of the present application, the image 71 of the pedestrian to be identified may be an image including a pedestrian extracted from a certain frame of image captured by the camera 601-1, and may also be an image including a pedestrian extracted from a certain frame of image in the surveillance video captured by the camera 601-1.
For example, the example of extracting the pedestrian image 71 to be identified from the surveillance video is described, and the surveillance video includes one or more frames of pedestrian images. Preprocessing the video containing the image of the pedestrian to be recognized to obtain an image 71 of the pedestrian to be recognized. The preprocessing comprises the steps of performing framing processing on a video containing an image of a pedestrian to be identified to generate a video frame, and performing target detection on the video frame to obtain the image containing the pedestrian. The video frames may also be referred to as image frames, images, pictures, and the like, and the image frames, the images, and the pictures in this application all refer to the video frames, and are not limited specifically herein.
It can be understood that, in the process of image processing the image 71 of the pedestrian to be recognized, the image 71 of the pedestrian to be recognized also needs to be set to be a preset size, so as to facilitate the fusion of the image 71 of the pedestrian to be recognized and other images which are also in the preset size. Specifically, the image of the pedestrian to be recognized 71 may be enlarged or reduced to obtain an image with a preset size, or the image of the pedestrian to be recognized 71 is an image with a preset size. The preset size is x × y, x represents the size in the image length direction, and y represents the size in the image width direction. The length and width of the image size may be in pixels or centimeters. For example, the preset size is 384 × 128, and the preset size is in units of pixels.
Further, it is understood that when the pedestrian image 71 to be recognized itself is of a preset size, the image of the pedestrian image 71 to be recognized does not need to be subjected to the enlargement or reduction processing.
Further, it is understood that the pedestrian image 71 to be recognized may be a color image (RGB image) or a black-and-white image.
In the embodiment of the present application, the image 71 of the pedestrian to be recognized may be set to a preset size by an image enlargement or reduction algorithm. Specifically, the image enlarging or reducing algorithm may be implemented by, for example, an S-Spline algorithm, a bicubic interpolation algorithm, a bilinear interpolation algorithm, a nearest neighbor interpolation algorithm, a median interpolation algorithm, or the like, which is not limited herein.
Next, the image 71 of the pedestrian to be recognized is an RGB image, and the preset size is 384 × 128, and the preset size is in units of pixels. The image recognition method in the embodiment of the present application is described in detail.
The size format of the pedestrian image to be recognized is 384 × 128 × 3, where 384 is the size in the length direction of the pedestrian image 71 to be recognized, 128 is the size in the width direction of the pedestrian image 71 to be recognized, and 3 is the number of channels of the pedestrian image 71 to be recognized, and the 3 channels are the R channel, the G channel, and the B channel, respectively.
Step 804: and obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body part thermodynamic diagram is used for enhancing the characteristics of the human body part of the pedestrian to be recognized.
It is understood that, in order to subsequently perform matching using the features of each slice portion whose semantic information is the same (both are heads) between the image to be recognized (for example, the image of the whole body) and the reference image (for example, the image of the half body), the accuracy of image recognition is improved. The body part classes in some images need to be determined. In the embodiment of the application, the human body position thermodynamic diagram or human body skeleton attention diagram can be obtained by performing position detection and part semantic detection on the image of the pedestrian to be recognized through the human body posture detection model 72, wherein the human body position thermodynamic diagram or human body skeleton attention diagram includes part semantic information of the human body part and position information of the human body part in the image of the pedestrian to be recognized.
It can be understood that human semantic information is the classification of body parts. Such as the head, upper body, lower body, etc. The position information of the human body part in the human body part thermodynamic diagram or human skeleton attention diagram can be the position coordinates of the human body part in the human body part thermodynamic diagram or human skeleton attention diagram.
For example, the human body position detection model 72 may perform position detection and part semantic detection on the image of the pedestrian to be recognized, so as to obtain a human body part thermodynamic diagram or a human body skeleton attention diagram, where the human body part thermodynamic diagram or the human body skeleton attention diagram includes part semantic information of the human body part and position information of the human body part in the human body part thermodynamic diagram or the human body skeleton attention diagram. Specifically, the image 71 of the pedestrian to be recognized is input to the human body posture detection model 72, the human body posture detection model 72 outputs a human body part thermodynamic diagram and a human body skeleton attention diagram, the part semantics includes the category of the human body part in the human body part thermodynamic diagram and the category of the human body part in the human body skeleton attention diagram, the position information includes the position coordinates of the human body part in the human body part thermodynamic diagram, and the position coordinates of the human body part in the human body skeleton attention diagram.
For example, the categories of human body parts for a human skeletal attention map include head, limbs, torso. The categories of the human body parts of the human body part thermodynamic diagram include an upper half, a left upper arm, a left lower arm, a right upper arm, a right lower arm, a left thigh, a left calf, a right thigh, a right calf and the like. For example, the position coordinates of the human body part in the human body part thermodynamic diagram of the human body part thermodynamic diagram include the position coordinates of the pixel points of the head in the human body part thermodynamic diagram, and the position coordinates of the human body part in the human body skeleton thermodynamic diagram of the human body skeleton thermodynamic diagram include the position coordinates of the pixel points of the head in the human body part thermodynamic diagram.
For example, the pedestrian image 71 to be recognized with the size of 384 × 128 × 3 is input to the human posture detection model 72, the length and width of the input image are not changed, and only the number of channels of the image is changed, that is, the number of channels is changed from the original 3 channels to 1 channel, that is, the number of channels of the output thermodynamic diagram and skeleton attention diagram of each human body part is 1, for example, 10 thermodynamic diagrams of the human body part and 1 skeletal attention diagram are 384 × 128 1, wherein 384 is the size in the length direction of the image, 128 is the size in the width direction of the image, and 1 is the number of channels of the image.
In an embodiment of the present application, the function of the human body posture detection model 72 may be implemented by an algorithm, such as an openpos algorithm, a Convolutional posture networks (CPM) algorithm, a poseset algorithm, an alphapos algorithm, or the like. But is not limited thereto.
Step 806: and generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic map.
In order to make the features of the part in the pedestrian image 71 to be recognized more prominent, the human skeleton attention map can be combined with the pedestrian image, so that the features of the human body part are more obvious, namely, the human body part enhancement map gives greater attention weight to the pedestrian posture key points and the human body parts, so that richer human body features are obtained when the features are extracted from the pedestrian posture key points and the human body parts.
In the embodiment of the present application, the pedestrian image 71 to be recognized and the human body part thermodynamic diagram of 10 sheets are input to the human body part enhancement module 73, and a human body part enhancement diagram is generated. The human body part enhancement map contains position information of the human body part. Specifically, the image features of the three channels of the pedestrian image 71 to be recognized having the size of 384 × 128 × 3 are respectively combined with 10 pieces of human body part thermodynamic diagrams having the size of 384 × 128 × 1 to generate a human body part enhancement map having the human body size of 384 × 128 × 13 by a combination algorithm. Wherein the 10 human body part thermodynamic diagrams show the position of the human body part in the image.
In the embodiment of the application, the merging algorithm is used for integrating the characteristic information and merging the number of channels of the image based on the pixel points of the image. Merging algorithms are often used to combine features, to fuse features extracted by multiple convolutional feature extraction frameworks, or to fuse information from output layers.
In the embodiment of the present application, there are many algorithms for merging the image 71 of the pedestrian to be recognized with the image of the human body part, for example, the concat function may be used to merge the image of the human body part with the image of the pedestrian to be recognized to generate the human body part enhancement map. The concat function is mainly used for merging the number of channels of the image.
As shown in fig. 9, fig. 9(a) is a human body thermodynamic diagram obtained by combining 10 human body part thermodynamic diagrams, fig. 9(b) is a pedestrian image 71 to be recognized, and fig. 9(c) is a human body part enhancement diagram generated by combining a pedestrian image and a human body part thermodynamic diagram. It is understood that after the pedestrian image 71 to be recognized is combined with the human body part thermodynamic diagram, the generated human body enhancement map can identify the position of the part of the human body to be recognized, for example, the position of the head, the position of the upper body, the position of the left upper arm, the position of the right upper arm, the position of the left lower arm, the position of the right lower arm, the position of the left upper leg, the position of the right upper leg, the position of the left lower leg, and the position of the right lower leg in the pedestrian image 71 to be recognized.
It can be understood that by combining the human body part thermodynamic diagram with the pedestrian image 71 to be recognized, the position of the part extracted from the pedestrian image 71 to be recognized is beneficial to comparing the human body appearance feature extraction to be recognized with the human body appearance feature in an image library or in real time, and the accuracy of image recognition is improved.
Step 808: determining a plurality of semantic information related to the body of the pedestrian to be recognized in the human body part enhancement map and a human body feature vector corresponding to each semantic information, wherein the human body feature vector comprises an enhancement feature vector representing the body part of the pedestrian to be recognized.
The human body feature vector comprises an enhanced feature vector representing a human body part and a feature vector representing an external human body wearing. Specifically, the enhanced feature vectors representing the human body parts comprise feature vectors of hair, face, limbs, gender and the like, and the feature vectors representing the external wearing of the human body comprise clothing features, ornament features and carrying object features. Wherein, the clothing characteristics include: a coat, trousers, a dress, shoes, etc. The ornament feature includes: hat, sunglasses, glasses, scarf, belt, etc. The characteristic of carrying thing includes: single shoulder satchels, backpack satchels, handbags, draw-bar boxes, umbrellas, etc.
In an embodiment of the present application, the extracted appearance features of the human body may be used to extract appearance features of the human body part in the human body part enhancement map, and generate a human body feature vector. The human body part feature vector comprises a plurality of human body feature vectors, wherein the human body appearance features are represented by one or more human body feature vectors. For example, facial features are represented by one or more human feature vectors, backpack features are represented by one or more human feature vectors, and coat features are represented by one or more human feature vectors. It is to be understood that the human feature vector may comprise only enhanced feature vectors characterizing body parts of the pedestrian to be identified.
In the embodiment of the present application, the human body feature vector is generated by extracting the appearance features of the human body part enhancement map through the human body appearance feature extraction model 74. The human appearance feature extraction model 74 may be a depth residual network (Resnet) pre-trained by the ImageNet data set to extract appearance features of the human enhancement map, and generate a human feature vector. The depth residual error network may be Resnet50 of 50 layers, Resnet101 of 101 layers, Resnet152 of 152 layers, Resnet152d of 152 layers, and the like, and specifically, the depth residual error network is not limited in this application, and the following will take Resnet152d as an example.
In an embodiment of the application, the human body part enhancement map is input into a ResNet152d network pre-trained by ImageNet with the downsampling and full connectivity layers removed at the ends, wherein the ResNet152d network initializes the network parameters using the pre-trained model of the pre-trained ImageNet prior to extracting the image features. The human body region enhancement map has a size of 384 × 128 × 13, and specifically, the human body region enhancement map having a size of 384 × 128 × 3 is input to the depth residual error network, and subjected to convolution processing 5 times, 24 × 8 × 2048 feature vectors are output at the res5c level of the network, and human body feature vectors having a size of 24 × 8 × 2048 are output. The 5-time convolution processing includes 4 times of downsampling, and the length and width of the human body part enhancement map input for each downsampling are reduced by two times, so that the size of the output human body feature vector is 24 × 8, 24 represents the size of the human body feature vector in the length direction, and 8 represents the size of the human body feature vector in the width direction.
Step 810: and matching the human body feature vector corresponding to each semantic information in the pedestrian image to be recognized with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the pedestrian image to be recognized is the image of the target pedestrian.
In an embodiment of the present application, the human skeletal attention map is input before the human part augmentation module 75. And compressing the human skeleton attention map to a human skeleton attention map with preset sizes, wherein the human skeleton attention map with the preset sizes is the same as the length and width sizes of the characteristic vectors of the human body parts. For example, the nearest neighbor interpolation algorithm performs scaling on the human skeleton attention diagram to generate a human skeleton attention diagram of a preset size, that is, a 24 × 8 × 1 human skeleton attention diagram is generated, where 24 is the length of the human skeleton attention diagram of the preset size, 8 is the width of the human skeleton attention diagram of the preset size, and 1 is the number of channels of the human skeleton attention diagram of the preset size.
In the embodiment of the present application, a human body feature vector is obtained by performing dot multiplication on pixels of a 24 × 8 × 2048 human body part feature vector and pixels of a 24 × 8 × 1 human body skeleton attention map corresponding position. The human body feature vector is fused with a human body skeleton attention map with a preset size, namely the human body feature vector is fused with human body part information, and pixels of human body parts in the generated human body feature vector have higher weight, so that the extraction and analysis of the human body feature information are facilitated. The human body part feature vector can be fused with the human body part information again by combining a human body skeleton attention map with a preset size.
In the embodiment of the application, based on the human body feature vector, the human body feature vector of the human body part enhancement map is compared with the human body feature vectors of other pedestrian images in the monitoring video through the image recognition module 76, and a prediction probability that the to-be-recognized pedestrian image is matched with the other pedestrian images in the video is generated.
In the embodiment of the application, the image recognition module 76 matches the human body feature vectors corresponding to the semantic information of each part of the pedestrian to be recognized with the human body feature vectors corresponding to the semantic information of each part of the target pedestrian in the reference image, and according to the matching similarity (i.e., the output prediction probability), when the prediction probability is greater than a certain preset value, the image recognition module 76 determines whether the image of the pedestrian to be recognized is the image where the target pedestrian is located. It is to be understood that the reference image may be an image containing a target pedestrian in the surveillance video. The preset value can be set to 0.8 or 0.9, and the specific preset value of the preset value is subject to practical application, which is not limited herein.
In an embodiment of the present application, the image recognition module 76 may be a classifier of the human appearance feature extraction model 74, wherein the classifier of the human appearance feature extraction model 74 is mainly used for generating a prediction probability that the image of the pedestrian to be recognized matches the reference image (i.e., the image of the pedestrian obtained by the camera in real time). The image of the pedestrian to be recognized can be a human body image in a database and can also be a pedestrian image acquired by the camera immediately by matching the image of the pedestrian to be recognized with the reference image.
Specifically, the classifier of the human appearance feature extraction model 74 includes three fully connected layers and one sigmoid regression layer. Wherein the three fully-connected layers include two hidden layers and one output layer. The sigmoid regression layer comprises a sigmoid function and a cross entropy loss function, the output vector of the output layer can obtain the prediction probability of matching the pedestrian image to be identified with other pedestrian images in the video through the sigmoid layer, and the cross entropy loss function is used for training the whole network. The training here means: the appearance characteristics of the samples in the sample set and the pedestrian images in the samples are used as the input of the human body characteristic detection network, the network parameters are adjusted by adopting a cross entropy loss function and combining a back propagation BP algorithm, and the optimal classification probability distribution result, namely the pedestrian category and probability distribution, is obtained after multiple iterations.
It can be understood that in the process of acquiring the human body characteristic information of the image of the pedestrian to be identified, the semantic information of the human body is combined twice to obtain richer human body characteristics. And matching the pedestrian image to be recognized containing the characteristics of the slice parts with the abundant human body characteristics and the same semantic information with the reference image containing the characteristics of the slice parts with the abundant human body characteristics and the same semantic information with the target pedestrian, so that the accuracy of image recognition is improved.
In the embodiment of the application, for the data set under the monitoring scene which is widely distributed, the average accuracy mean (mAP) and rank1 indexes can be respectively improved by applying the method to the data set under the monitoring scene by about 1-3 points. Wherein, the long and common evaluation indexes in the mAP image classification are used for measuring the accuracy of the algorithm. The larger the value of the mAP, the more accurate the image recognition performance. rank1 is used to measure the average correct rate of matching for the algorithm. The larger the value of mAP, the better the image recognition performance.
Table 1 shows the test results in different scenarios, where based on the human appearance feature extraction model with baseline network as resnet152 network, the maps are 90.65%, and the rank1 is 92.25%. Combining the thermodynamic diagram of human body parts on the basis of the resnet152 network, the mAP is 91.75%, and the rank1 is 93.50%. On the basis of the resnet152 network, by combining the human body part thermodynamic diagram and the skeleton attention diagram, the mAP is 92.86%, and the rank1 is 94.84%. The image recognition method and the device have the advantages that the pedestrian image to be recognized is combined with the position information of the human body part twice in the process of extracting the human body appearance features and identifying the identity, the part of the human body to be recognized is highlighted, the final image recognition result is more accurate, and the image recognition detection efficiency is improved.
TABLE 1
Method AP(%) MMR(%)
ResNet152d 90.65 92.25
ResNet152d + human body part thermodynamic diagram 91.75 93.50
ResNet152d + human body part thermodynamic diagram + human body skeleton attention diagram 92.86 94.84
Fig. 10 illustrates a block diagram of an image recognition device 1000, according to some embodiments of the present application. As shown in fig. 10, specifically, the method includes:
the acquiring module 1002 is used for acquiring an image of a pedestrian to be identified;
the part detection module 1004 is used for obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body part thermodynamic diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized;
the enhancement module 1006 is used for generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic diagram;
a feature extraction module 1008, configured to determine a plurality of semantic information related to the body of the pedestrian to be identified in the human body part enhancement map and a human body feature vector corresponding to each semantic information, where the human body feature vector includes an enhancement feature vector representing the body part of the pedestrian to be identified;
the identifying module 1010 is configured to match a human feature vector corresponding to each semantic information in the to-be-identified pedestrian image with a human feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determine whether the to-be-identified pedestrian image is an image where the target pedestrian is located.
Specifically, the identification module is further used for obtaining a human skeleton attention diagram of the pedestrian to be identified based on the image of the pedestrian to be identified, wherein the human skeleton attention diagram is used for enhancing the characteristics of the body part of the pedestrian to be identified; and matching the human body feature vector fused with the feature vector of the pixel point of the human body skeleton attention diagram with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the image of the pedestrian to be identified is the image of the target pedestrian.
In an embodiment of the present application, specifically, the image recognition apparatus 1000 further includes: the semantic information of the to-be-recognized pedestrian image comprises a face, a human body characteristic vector corresponding to the face in the to-be-recognized pedestrian image is matched with a human body characteristic vector corresponding to the face of the target pedestrian in the reference image, and whether the to-be-recognized pedestrian image is the image of the target pedestrian is determined. The semantic information of the to-be-identified pedestrian image comprises four limbs, the human body feature vector corresponding to the trunk in the to-be-identified pedestrian image is matched with the human body feature vector corresponding to the four limbs of the target pedestrian in the reference image, and whether the to-be-identified pedestrian image is the image of the target pedestrian is determined. The human body feature vector also comprises a feature vector for representing the external attachment of the pedestrian to be identified, the feature vector for representing the external attachment of the pedestrian to be identified corresponding to each semantic information in the image of the pedestrian to be identified is matched with the feature vector for representing the external attachment of the pedestrian to be identified corresponding to each semantic information related to the target pedestrian in the reference image, and whether the image of the pedestrian to be identified is the image where the target pedestrian is located is determined.
The image recognition apparatus 1000 further includes: and merging the pedestrian image to be recognized and the human body part thermodynamic diagram through a concat algorithm to generate a human body part feature enhancement diagram. The concat algorithm includes at least one of: a nearest neighbor interpolation algorithm, a bilinear interpolation algorithm, a mean interpolation algorithm, a median interpolation algorithm and an S-Spline algorithm.
It can be understood that the image recognition apparatus 1000 shown in fig. 10 corresponds to the image recognition method provided in the present application, and the technical details in the above detailed description about the image recognition method provided in the present application are still applicable to the image recognition apparatus 1000 shown in fig. 7, and the detailed description is referred to above and is not repeated herein.
Fig. 11 is a block diagram illustrating an electronic device 1100 according to some embodiments of the present application. FIG. 11 schematically illustrates an example electronic device 1100 in accordance with various embodiments. In some embodiments, electronic device 1100 may include one or more processors 1104, system control logic 1108 connected to at least one of processors 1104, system memory 1112 connected to system control logic 1108, non-volatile memory (NVM)1116 connected to system control logic 1108, and network interface 1120 connected to system control logic 1108.
In some embodiments, processor 1104 may include one or more single-core or multi-core processors. In some embodiments, the processor 1104 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.).
In some embodiments, system control logic 1108 may include any suitable interface controllers to provide any suitable interface to at least one of processors 1104 and/or to any suitable device or component in communication with system control logic 1108.
In some embodiments, system control logic 1108 may include one or more memory controllers to provide an interface to system memory 1112. System memory 1112 may be used to load and store data and/or instructions. Memory 1112 of electronic device 1100 may include any suitable volatile memory, such as suitable Dynamic Random Access Memory (DRAM), in some embodiments.
NVM/memory 1116 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM/memory 1116 may include any suitable non-volatile memory such as flash memory and/or any suitable non-volatile storage device such as at least one of a HDD (Hard Disk Drive), CD (Compact Disc) Drive, DVD (Digital Versatile Disc) Drive.
The NVM/memory 1116 may comprise a portion of a storage resource on the device on which the electronic device 1100 is mounted, or it may be accessible by, but not necessarily a part of, the device. The NVM/memory 1116 may be accessed over a network, for example, via a network interface 1120.
In particular, system memory 1112 and NVM/storage 1116 may each include: a temporary copy and a permanent copy of instructions 1124. The instructions 1124 may include: instructions that, when executed by at least one of the processors 1104, cause the electronic device 1100 to implement the method shown in fig. 8. In some embodiments, instructions 1124, hardware, firmware, and/or software components thereof may additionally/alternatively be located in system control logic 1108, network interface 1120, and/or processor 1104.
The network interface 1120 may include a transceiver to provide a radio interface for the electronic device 1100 to communicate with any other suitable device (e.g., front end module, antenna, etc.) over one or more networks. In some embodiments, the network interface 1120 may be integrated with other components of the electronic device 1100. For example, the network interface 1120 may be integrated with at least one of the processors 1104, the system memory 1112, the NVM/storage 1116, and a firmware device (not shown) having instructions that, when executed by at least one of the processors 1104, the electronic device 1100 implements the image recognition method as shown in fig. 11.
The network interface 1120 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 1120 may be a network adapter, wireless network adapter, telephone modem, and/or wireless modem.
In one embodiment, at least one of the processors 1104 may be packaged together with logic for one or more controllers of system control logic 1108 to form a System In Package (SiP). In one embodiment, at least one of the processors 1104 may be integrated on the same die with logic for one or more controllers of system control logic 1108 to form a system on a chip (SoC).
The electronic device 1100 may further include: input/output (I/O) devices 1132. The I/O device 1132 may include a user interface to enable a user to interact with the electronic device 1100; the design of the peripheral component interface enables peripheral components to also interact with the electronic device 1100. In some embodiments, the electronic device 1100 further comprises a sensor for determining at least one of environmental conditions and location information associated with the electronic device 1100.
Fig. 12 shows a block diagram of a SoC (System on Chip) 1200 according to an embodiment of the present application. In fig. 12, like parts have the same reference numerals. In addition, the dashed box is an optional feature of more advanced socs. In fig. 12, the SoC 1200 includes: an interconnect unit 1250 coupled to the application processor 1210; a system agent unit 1270; a bus controller unit 1280; an integrated memory controller unit 1240; a set or one or more coprocessors 1220 which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a Static Random Access Memory (SRAM) unit 1230; a Direct Memory Access (DMA) unit 1260. In one embodiment, coprocessor 1220 comprises a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPU, a high-throughput MIC processor, embedded processor, or the like.
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or tangible machine-readable memories for transmitting information using the Internet in the form of electrical, optical, acoustical or other propagated signals, e.g., carrier waves, infrared digital signals, etc.). Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).
In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.
It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.
It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.
While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims (10)

1. An image recognition method, characterized in that the method comprises:
acquiring an image of a pedestrian to be identified;
obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body part thermodynamic diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized;
generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic map;
determining a plurality of semantic information related to the body of the pedestrian to be identified in the human body part enhancement map and a human body feature vector corresponding to each semantic information, wherein the human body feature vector comprises an enhancement feature vector representing the body part of the pedestrian to be identified;
and matching the human body feature vector corresponding to each semantic information in the to-be-identified pedestrian image with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the to-be-identified pedestrian image is the image of the target pedestrian.
2. The method according to claim 1, wherein matching the human body feature vector corresponding to each semantic information in the image of the pedestrian to be identified with the human body feature vector corresponding to each semantic information related to the target pedestrian in a reference image to determine whether the image of the pedestrian to be identified is the image of the target pedestrian comprises:
obtaining a human body skeleton attention map of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body skeleton attention map is used for enhancing the characteristics of the body part of the pedestrian to be recognized;
matching the human body feature vector fused with the feature vector of the pixel point of the human body skeleton attention diagram with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the image of the pedestrian to be identified is the image of the target pedestrian.
3. The method according to claim 1, wherein the semantic information of the pedestrian image to be recognized comprises a face, and the human feature vector corresponding to the face in the pedestrian image to be recognized is matched with the human feature vector corresponding to the face of a target pedestrian in a reference image to determine whether the pedestrian image to be recognized is the image of the target pedestrian.
4. The method according to claim 3, wherein the semantic information of the pedestrian image to be recognized comprises limbs, and the human feature vector corresponding to the trunk in the pedestrian image to be recognized is matched with the human feature vector corresponding to the limbs of the target pedestrian in the reference image to determine whether the pedestrian image to be recognized is the image of the target pedestrian.
5. The method according to claim 4, wherein the human body feature vector further comprises feature vectors for characterizing external overlapping of the pedestrian to be identified, the feature vector for characterizing the external overlapping of the pedestrian to be identified corresponding to each semantic information in the image of the pedestrian to be identified is matched with the feature vector for characterizing the external overlapping of the pedestrian to be identified corresponding to each semantic information related to the target pedestrian in a reference image, and whether the image of the pedestrian to be identified is the image where the target pedestrian is located is determined.
6. The method of claim 1, wherein generating the human body part enhancement map of the pedestrian to be identified from the image of the pedestrian to be identified and the human body part thermodynamic map comprises:
and merging the pedestrian image to be identified and the human body part thermodynamic diagram through a concat algorithm to generate the human body part feature enhancement diagram.
7. The method of claim 6, wherein the concat algorithm comprises at least one of: a nearest neighbor interpolation algorithm, a bilinear interpolation algorithm, a mean interpolation algorithm, a median interpolation algorithm and an S-Spline algorithm.
8. An image recognition apparatus, comprising:
the acquisition module is used for acquiring an image of a pedestrian to be identified;
the part detection module is used for obtaining a human body part thermodynamic diagram of the pedestrian to be recognized based on the image of the pedestrian to be recognized, wherein the human body part thermodynamic diagram is used for enhancing the characteristics of the body part of the pedestrian to be recognized;
the enhancement module is used for generating a human body part enhancement map of the pedestrian to be recognized according to the image of the pedestrian to be recognized and the human body part thermodynamic map;
the feature extraction module is used for determining a plurality of semantic information related to the body of the pedestrian to be recognized in the human body part enhancement map and a human body feature vector corresponding to each semantic information, wherein the human body feature vector comprises an enhancement feature vector representing the body part of the pedestrian to be recognized;
and the identification module is used for matching the human body feature vector corresponding to each semantic information in the to-be-identified pedestrian image with the human body feature vector corresponding to each semantic information related to the target pedestrian in the reference image, and determining whether the to-be-identified pedestrian image is the image of the target pedestrian.
9. A readable medium having stored thereon instructions that, when executed on an electronic device, cause the electronic device to perform the image recognition method of any one of claims 1 to 7.
10. An electronic device, comprising:
a memory for storing instructions for execution by one or more processors of the electronic device, an
A processor, being one of the processors of the electronic device, for performing the image recognition method of any one of claims 1 to 7.
CN202011550542.8A 2020-12-24 2020-12-24 Image recognition method, device and medium and electronic equipment thereof Pending CN112580544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011550542.8A CN112580544A (en) 2020-12-24 2020-12-24 Image recognition method, device and medium and electronic equipment thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011550542.8A CN112580544A (en) 2020-12-24 2020-12-24 Image recognition method, device and medium and electronic equipment thereof

Publications (1)

Publication Number Publication Date
CN112580544A true CN112580544A (en) 2021-03-30

Family

ID=75139529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011550542.8A Pending CN112580544A (en) 2020-12-24 2020-12-24 Image recognition method, device and medium and electronic equipment thereof

Country Status (1)

Country Link
CN (1) CN112580544A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145927A (en) * 2017-06-16 2019-01-04 杭州海康威视数字技术股份有限公司 The target identification method and device of a kind of pair of strain image
CN110610154A (en) * 2019-09-10 2019-12-24 北京迈格威科技有限公司 Behavior recognition method and apparatus, computer device, and storage medium
CN111291641A (en) * 2020-01-20 2020-06-16 上海依图网络科技有限公司 Image recognition method and device, computer readable medium and system
CN111783626A (en) * 2020-06-29 2020-10-16 北京字节跳动网络技术有限公司 Image recognition method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145927A (en) * 2017-06-16 2019-01-04 杭州海康威视数字技术股份有限公司 The target identification method and device of a kind of pair of strain image
US20200134366A1 (en) * 2017-06-16 2020-04-30 Hangzhou Hikvision Digital Technology Co., Ltd. Target recognition method and apparatus for a deformed image
CN110610154A (en) * 2019-09-10 2019-12-24 北京迈格威科技有限公司 Behavior recognition method and apparatus, computer device, and storage medium
CN111291641A (en) * 2020-01-20 2020-06-16 上海依图网络科技有限公司 Image recognition method and device, computer readable medium and system
CN111783626A (en) * 2020-06-29 2020-10-16 北京字节跳动网络技术有限公司 Image recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109255352B (en) Target detection method, device and system
CN111787242B (en) Method and apparatus for virtual fitting
CN109690620B (en) Three-dimensional model generation device and three-dimensional model generation method
CN107808111B (en) Method and apparatus for pedestrian detection and attitude estimation
US9098740B2 (en) Apparatus, method, and medium detecting object pose
CN111583097A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN103514432B (en) Face feature extraction method, equipment and computer program product
JP7286010B2 (en) Human body attribute recognition method, device, electronic device and computer program
CN109684969B (en) Gaze position estimation method, computer device, and storage medium
WO2021018106A1 (en) Pedestrian detection method, apparatus, computer-readable storage medium and chip
CN109299658B (en) Face detection method, face image rendering device and storage medium
CN108876936B (en) Virtual display method and device, electronic equipment and computer readable storage medium
CN107766864B (en) Method and device for extracting features and method and device for object recognition
CN110796472A (en) Information pushing method and device, computer readable storage medium and computer equipment
CN109977832B (en) Image processing method, device and storage medium
CN113807361B (en) Neural network, target detection method, neural network training method and related products
CN107563357B (en) Live-broadcast clothing dressing recommendation method and device based on scene segmentation and computing equipment
CN112308977B (en) Video processing method, video processing device, and storage medium
Galteri et al. Deep 3d morphable model refinement via progressive growing of conditional generative adversarial networks
Linda et al. Color-mapped contour gait image for cross-view gait recognition using deep convolutional neural network
JP2023539865A (en) Real-time cross-spectral object association and depth estimation
WO2023279799A1 (en) Object identification method and apparatus, and electronic system
CN110909691A (en) Motion detection method and device, computer readable storage medium and computer equipment
CN113158974A (en) Attitude estimation method, attitude estimation device, computer equipment and storage medium
KR101726692B1 (en) Apparatus and method for extracting object

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination