WO2021008252A1

WO2021008252A1 - Method and apparatus for recognizing position of person in image, computer device and storage medium

Info

Publication number: WO2021008252A1
Application number: PCT/CN2020/093608
Authority: WO
Inventors: 石磊; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-07-12
Filing date: 2020-05-30
Publication date: 2021-01-21
Also published as: CN110502986A

Abstract

The present application relates to a method and an apparatus for recognizing the position of a person in an image on the basis of a neural network, a computer device and a storage medium. The method comprises: acquiring a monitoring video file to be recognized, and pre-processing said monitoring video file to obtain a video image to be recognized; determining the image type of said video image; when the image type is a color image, recognizing key human body points in said video image by means of a human body posture model obtained by training, and determining position information of a person in said video image on the basis of the recognized key human body points; and when the image type is a night vision image, recognizing the position information of the person in said video image by means of a lightweight object detection model obtained by training. Said method improves the working efficiency.

Description

Method, device, computer equipment and storage medium for recognizing person position in image

Based on the Paris Convention, this application declares that it enjoys the priority of the Chinese patent application filed on July 12, 2019, with the application number CN201910628940.8 and titled "Methods, Devices, Computer Equipment and Storage Media for Identifying the Position of People in Images". The entire content of the patent application is incorporated into this application by reference.

Technical field

This application relates to the field of computer technology, in particular to a method, device, computer equipment and storage medium for identifying the position of a person in an image.

Background technique

With the needs of social economy and safe production, video surveillance equipment has been deployed more and more widely in fields such as safe cities, smart transportation, and security engineering. Moreover, in recent years, video surveillance has developed in the direction of high-definition, networking and intelligence. However, due to the widespread use of surveillance video, a large amount of video data generated by massive cameras is also increasing. In order to view the target, it is necessary to query from the massive video data. The inventor realized that the existing query methods mainly rely on human viewing. And manual retrieval, leading to problems such as low automation of video content monitoring and slow query efficiency.

technical problem

Based on this, it is necessary to provide a method, a device, a computer device, and a storage medium for identifying the position of a person in an image that can improve efficiency in response to the above technical problems.

Technical solutions

A method for identifying the position of a person in an image, the method comprising:

Acquiring a surveillance video file to be identified, and preprocessing the surveillance video file to be identified to obtain a video image to be identified;

Determining the image type of the video image to be recognized;

When the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body;

When the image type is a night vision image, the lightweight target detection model obtained through training recognizes the position information of the person in the video image to be recognized.

A device for recognizing the position of a person in an image, the device comprising:

The preprocessing module is used to obtain the surveillance video file to be identified, and preprocess the surveillance video file to be identified to obtain the video image to be identified;

The determining module is used to determine the image type of the video image to be recognized;

The recognition module is used to recognize the key points of the human body in the video image to be recognized by the human body posture model obtained by training when the image type is a color image, and determine the video image to be recognized based on the recognized key points of the human body Position information in

The recognition module is also used to recognize the position information of the person in the video image to be recognized by the lightweight target detection model obtained through training when the image type is a night vision image.

A computer device includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, the following steps are implemented:

Determining the image type of the video image to be recognized;

A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the method for recognizing the position of a person in an image is realized.

Beneficial effect

In the above method, device, computer equipment and storage medium for identifying the position of a person in an image, when the surveillance video file to be identified is acquired, the surveillance video file to be identified is preprocessed to obtain the video image to be identified, thereby facilitating the subsequent processing of video content identification . After determining the image type of the video image to be recognized, call the corresponding recognition model according to the image type, that is, when the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and based on the recognized The key points of the human body determine the position information of the person in the video image to be recognized. When the image type is a night vision image, the lightweight target detection model obtained through training recognizes the position information of the person in the video image to be recognized. This ensures that different types of video images to be recognized can be recognized with the most matching recognition model, and the accuracy of recognition is improved. And according to different recognition models to detect the position of the person in the video image, it can get rid of the old manual recognition and viewing method, and realize the automatic and rapid recognition of surveillance video content. Improve work efficiency.

Description of the drawings

FIG. 1 is an application scene diagram of the method for recognizing the position of a person in an image in an embodiment;

FIG. 2 is a schematic flowchart of a method for recognizing the position of a person in an image in an embodiment;

FIG. 3 is a schematic flowchart of the step of determining the type of video image in an embodiment;

4 is a schematic flowchart of a method for recognizing the position of a person in an image in another embodiment;

FIG. 5 is a structural block diagram of an apparatus for recognizing the position of a person in an image in an embodiment;

Fig. 6 is an internal structure diagram of a computer device in an embodiment.

Embodiments of the invention

In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

The method for identifying the position of a person in an image provided by this application can be applied to the application environment as shown in FIG. 1. Wherein, the monitoring device 102 communicates with the server 104 through the network. The server 104 obtains the surveillance video file to be identified sent by the surveillance device 102, and the server 104 preprocesses the surveillance video file to obtain the video image to be identified. The server 104 determines the image type of the video image to be recognized. When the image type is a color image, the server 104 recognizes the key points of the human body in the image to be recognized through the human body posture model obtained through training, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body. When the image type is a night vision image, the server 104 recognizes the position information of the person in the video image to be recognized through the lightweight target detection model obtained through training. Among them, the monitoring device 102 can be, but is not limited to, various cameras, personal computers with cameras, notebook computers, smart phones, tablet computers, and portable wearable devices, etc. The server 104 can be an independent server or composed of multiple servers. Server cluster to achieve.

In an embodiment, as shown in FIG. 2, a method for recognizing the position of a person in an image is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:

Step S202: Obtain a surveillance video file to be identified, and perform preprocessing on the surveillance video file to be identified to obtain a video image to be identified.

Among them, the surveillance video file to be identified refers to the file that includes the surveillance video collected by the surveillance equipment. It can be understood that the surveillance video file to be identified includes, but is not limited to, the surveillance video collected by the surveillance equipment and sent to the server. It can also be a file with a transmission function. Other terminal devices that communicate with the server. That is, the surveillance video file to be identified obtained by the server can come from the surveillance device or the video file sent by other terminal devices. Preprocessing means that the surveillance video file to be identified is decoded to obtain the corresponding surveillance video to be identified, and the surveillance video to be identified is segmented to obtain the surveillance video to be identified in the surveillance video to be identified, and the gray scale of the identified video image is adjusted and removed. Technical processing such as drying and sharpening, that is, through adjustments to improve image quality and noise to ensure image clarity and quality.

Specifically, the user can issue a character position recognition instruction through the monitoring device, and select the to-be-recognized monitoring video that needs to be recognized. When the surveillance equipment receives the person position recognition instruction issued by the user, it obtains the surveillance video to be identified selected by the user, compresses and encapsulates the surveillance video file to be identified, and sends the surveillance video file to be identified to the corresponding server. And send a request for person location recognition to the corresponding server. After the server receives the person location recognition request, it decodes and restores the to-be-recognized surveillance video file corresponding to the person location-recognition request to obtain the to-be-recognized surveillance video, and then preprocesses the to-be-recognized surveillance video to obtain the to-be-recognized surveillance video. Video image.

Step S204: Determine the image type of the video image to be recognized.

Specifically, after the server preprocesses the surveillance video file to be recognized to obtain the corresponding video image to be recognized, it determines whether the video image to be recognized is a night vision image or a color image by acquiring pixel values in the video image to be recognized.

Step S206: When the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body.

Among them, the human pose model is an openpose model. The openpose model is a pose detection framework used to detect human joints, such as key points such as the neck, shoulders, and elbows, and link the key points to obtain the human body posture. The openpose model includes a pre-network layer and a dual-branch multi-level CNN network (Convolutional Neural Networks, convolutional neural network). The front network is in the VGG network (Visual Geometry Group Network, the super-resolution test network) is a modified VGG-19 network, including ten two-dimensional convolutional layers and modified linear unit layers in series, with three pooling layers inserted in between. That is, the VGG-19 module includes 4 blocks, among which, two convolutional layers and two modified linear units in block1, block2 and block4, four convolution kernels and four modified linear units in block3, and 3 pooling layers Between each block. The two-branch multi-level CNN network includes the confidence network and the correlation vector field network.

Specifically, after the server determines based on the type of the video image to be recognized, if the type of the video image to be recognized is a color image, the openpose model is called as the recognition model of the video image to be recognized. Input the to-be-recognized video image into the openpose model, and use the openpose model to recognize the to-be-recognized video image to obtain the key points of the human body in the to-be-recognized video image, thereby obtaining the position of the person according to the key points of the human body.

Step S208: When the image type is a night vision image, the lightweight target detection model obtained through training is used to identify the position information of the person in the video image to be identified.

Among them, the lightweight target detection model is ssdlite (Single Shot Detector-Lite, a lightweight single-shot detector) model. The ssdlite model is a target detection framework that is used to identify whether there is a target. In this embodiment, in order to improve the accuracy of the model, the original loss (loss function) of the ssdlite model is changed to focal loss. In addition, since night vision images are difficult to detect various key points of the human body posture, in this embodiment, the openpose model is used to detect color images, and the ssdlite model is used to detect night vision images.

Specifically, after the server determines the type of the video image to be recognized based on it, and if the type of the video image to be recognized is a night vision image, the ssdlite model is called as the recognition model of the video image to be recognized, and the ssdlite model is subsequently used for the video image to be recognized. Recognize the video image to identify the position of the person.

In the above method for identifying the position of a person in an image, after the surveillance video file to be identified is obtained, the surveillance video file to be identified is preprocessed to obtain the video image to be identified, thereby facilitating the subsequent processing of video content identification. After determining the image type of the video image to be recognized, call the corresponding recognition model according to the image type, that is, when the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and based on the recognized The key points of the human body determine the position information of the person in the video image to be recognized. When the image type is a night vision image, the lightweight target detection model obtained through training recognizes the position information of the person in the video image to be recognized. This ensures that different types of video images to be recognized can be recognized with the most matching recognition model, and the accuracy of recognition is improved. And according to different recognition models to detect the position of the person in the video image, it can get rid of the old manual recognition and viewing method, and realize the automatic and rapid recognition of surveillance video content. Improve work efficiency.

In one embodiment, as shown in FIG. 3, step S204, determining the image type of the video image to be recognized includes the following steps:

Step S302: Obtain the three-channel pixel value of each pixel in the video image to be identified.

Among them, pixels refer to the small squares that make up the image, that is, the smallest unit in the image. And the small square has a clear position and assigned color value. The color and position of the small square determine the appearance of the image. The pixel value is the color value corresponding to the pixel, and the image type can be determined by the pixel value. Image types include night vision images and color images. The three-channel pixel value is the RGB pixel value, and the RGB pixel value is the color value that determines the displayed color of the image. RGB are red, green, and blue respectively. Specifically, when the server determines whether the video image is a night vision image or a color image according to the pixel value of the image image, it first obtains the RGB pixel values corresponding to all pixels in the image.

Step S304: Perform difference calculation based on the pixel values of the three channels, and select the value with the largest difference as the pixel difference.

Specifically, after obtaining the three-channel pixel value of each pixel, that is, after obtaining the RGB pixel value, the difference calculation is performed on the RGB. The difference calculation is to subtract any two of the RGB, and select the value with the largest difference among the obtained multiple differences as the pixel difference value corresponding to this pixel. For example, taking pixel 1 as an example, the RGB value corresponding to pixel 1 is obtained. Each RGB has a corresponding component value. The specific component value depends on the specific image. Generally, the corresponding component value of RGB is between 0-255. That is, the component value corresponding to R, the component value corresponding to G, and the component value corresponding to B are respectively obtained, and then the three component values are subjected to mutual difference operation. It is equivalent to calculating the absolute value of R-G, R-B, and G-B respectively. Because the values of R-B or B-R are the same, but the sign is opposite, and the sign is opposite, although there is a mathematical difference, but not for the pixel. Therefore, the calculation steps can be reduced by taking the absolute value, and the calculation can be completed quickly. In other words, the maximum difference value is taken as the pixel difference value of pixel 1, that is, the absolute value of R-G, the absolute value of R-B, and the absolute value of G-B are selected as the pixel difference value of pixel 1.

Step S306: Determine the image type of the video image to be recognized according to the preset value and the pixel difference value.

The preset value is a preset reference pixel value used to determine whether the video image is a color image or a night vision image. In this embodiment, the preset value is 10. Specifically, after the pixel difference value corresponding to the pixel is obtained, the pixel difference value is compared with a preset value of 10. If the pixel difference is greater than the preset value of 10, it is determined that the video image to be recognized is a color image, and if the pixel difference is less than or equal to the preset value of 10, it is determined that the video image to be recognized is a night vision image. In this embodiment, the image type of the video image to be recognized is determined by the pixel value of the video image to be recognized, ensuring that the recognition model that best matches the video image to be recognized can be called for subsequent recognition according to the image type of the video image to be recognized. Recognition accuracy rate.

In another embodiment, step S204, determining the image type of the video image to be recognized includes: acquiring the acquisition mode adjustment time of the surveillance device corresponding to the surveillance video file to be identified, and acquiring the shooting time corresponding to the video image to be identified; adjusting according to the acquisition mode Time determines the image type of the video image to be recognized.

Specifically, the monitoring device has two modes, including a color collection mode and a night vision black and white collection mode. When the surveillance equipment collects surveillance video, the quality of the color video collected under the condition of low light is lost. In order to ensure the quality of the surveillance video, when the light is low, the surveillance equipment can automatically adjust the color acquisition mode to the night vision black and white mode to collect night vision black and white surveillance videos. Therefore, when the content of the video image to be identified is determined, the acquisition mode adjustment time of the monitoring device corresponding to the surveillance video file to be identified corresponding to the video image content to be identified is acquired, that is, the time to adjust from the color acquisition mode to the night vision black and white mode is acquired, thereby Determine the time for the monitoring device to adjust the mode. Then, the shooting time of the video image to be recognized is further obtained, and the shooting time of the video image to be recognized can be obtained from the video information. By comparing the acquisition mode adjustment time with the shooting time, when the shooting time is before the acquisition mode adjustment time, the video image to be recognized can be determined to be a color image, and when the shooting time is after the acquisition mode adjustment time, the video image to be recognized can be determined The video image is a night vision image. In one embodiment, when the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body, It specifically includes: using the front network layer of the human pose model to perform feature extraction on the video image to be recognized to obtain the feature map corresponding to the video image to be recognized; using the confidence network layer of the human pose model to extract the video image to be recognized from the feature map The key points of the human body in the human body are obtained, and the key point confidence map corresponding to the key points of the human body in the video image to be recognized is obtained; the correlation vector network layer of the human pose model is used to extract the key points of the human body in the video image to be recognized Correlation: Determine the position information of the person in the video image to be recognized according to the correlation between the key point confidence map and the key points of the human body.

Specifically, when the video image to be recognized is a color image, the video image to be recognized is first input into the front network of the human body note model, and the front network layer performs the operation of feature extraction such as convolution pooling on the video image to be recognized, and the conversion is obtained The feature map corresponding to the video image to be recognized. Then input the feature map to the dual-branch multi-level CNN network, that is, through the confidence network in the dual-branch network, get each human key point and the corresponding key point confidence map, and get the correlation vector field network in the dual-branch network The degree of association of each key point of the human body is to determine the position information of the person in the video image to be recognized according to the degree of association between the key point confidence map and the key point of the human body. Through the key point confidence map, the key points of the human body in the video image to be recognized can be found, and the effective connection between the key points of the human body can be obtained according to the degree of association, that is, the position of the person can be determined through the key point confidence map and the degree of association.

In one embodiment, determining the position information of the person in the video image to be recognized according to the correlation between the key point confidence map and the key points of the human body includes: connecting the key points of the human body on the key point confidence map according to the correlation of the key points of the human body , And calculate the key point contour; obtain the minimum circumscribed rectangle according to the key point contour, which is the rectangle with the smallest area including the contour of the key point; determine the position information of the person in the video image to be recognized according to the minimum circumscribed rectangle.

Among them, the key point contour refers to an irregular shape that frames the key points of the human body, and the minimum circumscribed rectangle refers to the smallest rectangle that frames all the key point contours. Specifically, the opencv tool is used to calculate according to the key point confidence map and the degree of association. First, the key points of the human body on the key point confidence map are connected according to the degree of association to obtain the posture corresponding to the human body. And, at the same time, use the opencv tool to calculate the outline of the key points. According to the outline of the key points, the smallest circumscribed rectangle is obtained. The area circumscribed by the smallest rectangle is the position of the character, and the position coordinates of the circumscribed smallest rectangle is the character position information. Among them, if the obtained minimum circumscribed rectangle is deviated from the regular rectangle, that is, the obtained minimum circumscribed rectangle is an irregular rectangle, then it is corrected to a regular rectangle, and the final obtained minimum circumscribed rectangle is a regular rectangle .

In an embodiment, as shown in FIG. 4, another method for recognizing the position of a person in an image is provided, that is, after obtaining the position information of the person to be recognized, the method further includes the following steps:

Step S210, generating video information corresponding to the character position information.

Step S212: Write the video information into the corresponding log.

Among them, the preset target includes but is not limited to the human body, and can also be other objects, which are preset according to actual needs. Logs refer to files used to record video information. Specifically, in this embodiment, a surveillance video is taken as an example, and the requirement for identifying the surveillance video is to identify a human body appearing in the surveillance video. Therefore, the human body is taken as the preset target in this embodiment. When the video content is obtained by recognition and detection, corresponding video information is generated based on the detected position information of the person. Wherein, the video information includes whether the video image includes a preset target, which surveillance video file the video image comes from, and the coordinate position of the preset target in the video image. It can be understood that the source of the video image and the coordinate position of the person in the video image are obtained, and then packaged into a file to obtain the generated video information. When the video information is generated, write the video information into the corresponding log. When you need to know the surveillance video content later, you can directly call the log file. The video information recorded in the log file can be used to know the video content of all surveillance video files. .

In one embodiment, the recognition model is a pre-trained network model, that is, the human pose model and the lightweight target detection model are pre-trained models used for character position recognition. Training the human body pose model and lightweight target detection model specifically includes: acquiring historical surveillance videos of monitoring equipment; extracting color image samples and night vision image samples from historical surveillance videos, and performing key points of the human body in the color image samples Annotate and annotate the position coordinates of the human body in the night vision image sample to obtain annotated color image and annotated night vision image; respectively adjust the size of the annotated color image and annotated night vision image to obtain training color image and training night vision image ; Map the key points of the human body in the training color image with the key points of the human body marked in the marked color image, and use the mapped training color image to train the human pose model; the position coordinates in the training night vision image and the mark The location coordinates marked in the night vision image are mapped, and the light-weight target detection model is trained using the mapped training night vision image.

Specifically, the recognition model is trained based on the training images obtained from the historical surveillance video, so that the recognition model can fully learn the surveillance scene, and the subsequent identification of surveillance video content is more accurate. Since the recognition model includes two models, openpose and ssdlite, which recognize different types of images, the historical surveillance video obtained should include color video and night vision video.

After acquiring the historical surveillance video, use FFmpeg to extract video images that meet the training requirements from the historical video, that is, extract the color image samples and night vision image samples including the human body. . After obtaining the color image sample and night vision image sample including the human body, the annotation software is used to mark the coordinates of the key points of the human body in the color image to obtain the labeled color image. Among them, the labeling software includes but not limited to labelme standard software. And the color image is used to mark the key points of the human body. Generally, the key points of the human body can have different numbers such as 9, 14, 16, 17, 18. In order to achieve more accurate recognition, this embodiment preferably marks the coordinates of 18 key points. The key points include nose, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle, left eye, right eye, left ear, Right ear. The night vision image is directly marked with the coordinates of the position of the person, and the marked night vision image is obtained. Among them, the coordinates can be expressed as (minimum x-coordinate, minimum y-coordinate, maximum x-coordinate, and maximum y-coordinate), that is, (xmin, ymin, xmax, ymax).

Since the openpose and ssdlite models process images differently, the size of the input image that can be accepted is different. Therefore, after the color image is annotated with human body key points, the corresponding annotated color image needs to be scaled to 432*368 size, and the annotated night vision image is scaled to 300*300, and the scaled annotated color image and annotated night vision image are used as training colors Images and training night vision images. However, since the coordinate positions marked in the zoomed annotated color image and annotated night vision image will change, the zoomed annotated coordinates will be mapped to the target annotation before zooming, that is, the training color image and the training night vision image will be mapped to the corresponding After mapping the labeled color image and the labeled night vision image, the training color image and the training night vision image are input to the corresponding model for training, and the mapping relationship is established after training, so that the correct coordinates can be learned during the model training process. Among them, the color image is input to the Openpose model for training, and the night vision image is input to the ssdlite model for training.

It should be understood that, although the various steps in the flowchart of FIGS. 2-3 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 2-3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In one embodiment, as shown in FIG. 5, a device for identifying the position of a person in an image is provided, including: a preprocessing module 502, a determination module 504, and an identification module 506, wherein:

The preprocessing module 502 is used to obtain the surveillance video file to be identified, and preprocess the surveillance video file to be identified to obtain the video image to be identified.

The determining module 504 is used to determine the image type of the video image to be recognized.

The recognition module 508 is used to recognize the key points of the human body in the video image to be recognized by the human body posture model obtained through training when the image type is a color image, and determine the position information of the person in the video image to be recognized based on the recognized key point of the human body .

The recognition module 508 is also used to recognize the position information of the person in the video image to be recognized by the lightweight target detection model obtained through training when the image type is a night vision image.

In one embodiment, the determining module 504 is further configured to obtain the three-channel pixel value of each pixel in the video image to be identified; perform difference calculation based on the three-channel pixel value, and select the value with the largest difference as the pixel difference value; The value and the pixel difference determine the image type of the video image to be recognized.

In one embodiment, the determining module 504 is further configured to obtain the acquisition mode adjustment time of the monitoring device corresponding to the surveillance video file to be identified, and the shooting time corresponding to the video image to be identified; determine the image of the video image to be identified according to the acquisition mode adjustment time Types of.

In one embodiment, the recognition module 508 is further configured to use the front network layer of the human pose model to perform feature extraction on the video image to be recognized to obtain the feature map corresponding to the video image to be recognized; use the confidence network layer of the human pose model to extract features from the Extract the human body key points of the human body in the video image to be recognized, and obtain the key point confidence map corresponding to the human body key points in the video image to be recognized; use the correlation vector network layer of the human pose model to extract the video to be recognized from the feature map The degree of association of each of the key points of the human body in the image; the position information of the person in the video image to be recognized is determined according to the degree of association between the key point confidence map and the key points of the human body.

In one embodiment, the recognition module 508 is further configured to connect the key points of the human body on the key point confidence map according to the degree of association of the key points of the human body, and calculate the key point contour; obtain the minimum circumscribed rectangle according to the key point contour, The smallest rectangle is the rectangle with the smallest area including the outline of the key point; the position information of the person in the video image to be recognized is determined according to the circumscribed smallest rectangle.

In an embodiment, the device for identifying the position of a person in an image further includes a generating module for generating video information corresponding to the position information of the person; and writing the video information into the corresponding log.

In one embodiment, the device for recognizing the position of a person in an image further includes a training module for acquiring historical surveillance videos of the surveillance equipment; extracting color image samples and night vision image samples from the historical surveillance videos, and comparing the color image samples Annotate the key points of the human body and annotate the position coordinates of the human body in the night vision image sample to obtain annotated color image and annotated night vision image; adjust the size of the annotated color image and annotated night vision image to obtain the training color Images and training night vision images; map the human body key points in the training color image with the human body key points marked in the annotated color image, and use the mapped training color image to train the human body posture model; the night vision image will be trained The position coordinates in and the position coordinates marked in the marked night vision image are mapped, and the light-weight target detection model is trained using the mapped training night vision image.

For the specific definition of the device for recognizing the position of the person in the image, please refer to the above definition of the method for recognizing the position of the person in the image, which is not repeated here. The various modules in the above-mentioned surveillance video content recognition device can be implemented in whole or in part by software, hardware and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 6. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The computer equipment database is used to store data. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a method of recognizing the position of the person in the image.

Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

In one embodiment, a computer device is provided, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

Obtain the surveillance video file to be identified, and preprocess the surveillance video file to be identified to obtain the video image to be identified;

Determine the image type of the video image to be recognized;

When the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the video image to be recognized, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body;

When the image type is a night vision image, the lightweight target detection model obtained through training can identify the position information of the person in the video image to be recognized.

In an embodiment, the processor further implements the following steps when executing the computer program:

Obtain the three-channel pixel value of each pixel in the video image to be recognized; calculate the difference based on the three-channel pixel value, and select the value with the largest difference as the pixel difference; determine the image of the video image to be recognized according to the preset value and the pixel difference Types of.

Obtain the acquisition mode adjustment time of the surveillance device corresponding to the surveillance video file to be identified, and acquire the shooting time corresponding to the video image to be identified; determine the image type of the video image to be identified according to the acquisition mode adjustment time.

Use the pre-network layer of the human pose model to extract features of the video image to be recognized to obtain the feature map corresponding to the video image to be recognized; use the confidence network layer of the human pose model to extract the human body in the video image to be recognized from the feature map Human body key points, obtain the key point confidence map corresponding to the human body key points in the video image to be recognized; use the correlation vector network layer of the human pose model to extract the correlation degree of each of the human key points in the video image to be recognized from the feature map; Determine the position information of the person in the video image to be recognized according to the correlation between the key point confidence map and the key points of the human body.

According to the correlation degree of the key points of the human body, connect the key points of the human body on the key point confidence map, and calculate the key point contour; obtain the minimum circumscribed rectangle according to the key point contour, and the minimum circumscribed rectangle is the smallest area including the contour of the key point The rectangle; the position information of the person in the video image to be recognized is determined according to the minimum rectangle circumscribed.

Generate video information corresponding to character position information; write the video information into the corresponding log.

Obtain historical surveillance videos of surveillance equipment; extract color image samples and night vision image samples from historical surveillance videos, and mark human body key points in the color image samples, and position coordinates of the human body in the night vision image samples Annotate, get annotated color image and annotated night vision image; respectively adjust the size of annotated color image and annotated night vision image to obtain training color image and training night vision image; combine the human body key points in the training color image with the annotated color image The key points of the human body marked in the mapping are mapped, and the human body pose model is trained using the mapped training color image; the position coordinates in the training night vision image are mapped to the position coordinates marked in the marked night vision image, and the mapped The training night vision images for training lightweight target detection models.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

Determine the image type of the video image to be recognized;

In an embodiment, the computer program further implements the following steps when being executed by the processor:

According to the correlation degree of the key points of the human body, connect the key points of the human body on the key point confidence map, and calculate the key point contour; obtain the minimum circumscribed rectangle according to the key point contour, and the minimum circumscribed rectangle is the smallest area including the contour of the key point The rectangle; the position information of the person in the video image to be recognized is determined according to the smallest rectangle circumscribed.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. The medium may also be stored in a volatile computer readable storage medium. When the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for identifying the position of a person in an image, the method comprising:

Acquiring a surveillance video file to be identified, and preprocessing the surveillance video file to be identified to obtain a video image to be identified;

Determining the image type of the video image to be recognized;

When the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body;

When the image type is a night vision image, the lightweight target detection model obtained through training recognizes the position information of the person in the video image to be recognized.
The method according to claim 1, wherein the step of determining the image type of the video image to be recognized comprises:

Acquiring the three-channel pixel value of each pixel in the video image to be identified;

Perform difference calculation based on the three-channel pixel values, and select the value with the largest difference as the pixel difference;

The image type of the video image to be recognized is determined according to the preset value and the pixel difference value.
The method according to claim 1, wherein said determining the image type of the video image to be recognized comprises:

Acquiring the acquisition mode adjustment time of the monitoring device corresponding to the surveillance video file to be identified, and acquiring the shooting time corresponding to the video image to be identified;

The image type of the video image to be recognized is determined according to the acquisition mode adjustment time.
The method according to claim 1, wherein when the image type is a color image, the human body posture model obtained through training recognizes the human body key points in the image to be recognized, and determines the human body key points based on the recognized human body key points The position information of the person in the video image to be recognized includes:

Performing feature extraction on the video image to be recognized by using the pre-network layer of the human pose model to obtain a feature map corresponding to the video image to be recognized;

Extracting the human body key points of the human body in the video image to be recognized from the feature map by using the confidence network layer of the human pose model to obtain a key point confidence map corresponding to the human body key points in the video image to be recognized;

Extracting the correlation degree of each key point of the human body in the video image to be recognized from the feature map by using the correlation degree vector network layer of the human body pose model;

The person position information of the video image to be recognized is determined according to the correlation between the key point confidence map and the key points of the human body.
The method according to claim 4, wherein the determining the person position information of the video image to be recognized according to the correlation between the key point confidence map and the key points of the human body comprises:

According to the correlation degree of the key points of the human body, connect the key points of the human body on the key point confidence map, and calculate the outline of the key points;

Obtaining a minimum circumscribed rectangle according to the contour of the key point, where the minimum circumscribed rectangle is a rectangle with the smallest area including the contour of the key point;

The position information of the person in the video image to be recognized is determined according to the minimum circumscribed rectangle.
The method according to claim 1, wherein after obtaining the position information of the person to be identified, the method further comprises:

Generating video information corresponding to the character position information;

Write the video information into the corresponding log.
The method according to claim 1, wherein before said acquiring the surveillance video file to be recognized, it further comprises the step of training said human body pose model and said lightweight target detection model; said training said human body pose model The steps of the lightweight target detection model include:

Obtain historical surveillance videos of surveillance equipment;

Extracting color image samples and night vision image samples from the historical surveillance video, marking the human body key points in the color image samples, and marking the position coordinates of the human body in the night vision image samples , Get annotated color image and annotated night vision image;

Performing size adjustment on the labeled color image and the labeled night vision image to obtain a training color image and a training night vision image;

Mapping the human body key points in the training color image with the human body key points marked in the marked color image, and train the human body pose model using the mapped training color image;

The position coordinates in the training night vision image are mapped to the position coordinates marked in the marked night vision image, and the light-weight target detection model is trained using the mapped training night vision image.
A device for identifying the position of a person in an image, wherein the device includes:

The preprocessing module is used to obtain the surveillance video file to be identified, and preprocess the surveillance video file to be identified to obtain the video image to be identified;

The determining module is used to determine the image type of the video image to be recognized;

The recognition module is used to recognize the key points of the human body in the video image to be recognized by the human body posture model obtained by training when the image type is a color image, and determine the video image to be recognized based on the recognized key points of the human body Position information in

The recognition module is also used to recognize the position information of the person in the video image to be recognized by the lightweight target detection model obtained through training when the image type is a night vision image.
The apparatus of claim 8, wherein the determining the image type of the video image to be recognized comprises:

Acquiring the three-channel pixel value of each pixel in the video image to be identified;

Perform difference calculation based on the three-channel pixel values, and select the value with the largest difference as the pixel difference;

The image type of the video image to be recognized is determined according to the preset value and the pixel difference value.
The apparatus of claim 8, wherein the determining the image type of the video image to be recognized comprises:

Acquiring the acquisition mode adjustment time of the monitoring device corresponding to the surveillance video file to be identified, and acquiring the shooting time corresponding to the video image to be identified;

The image type of the video image to be recognized is determined according to the acquisition mode adjustment time.
The device according to claim 8, wherein when the image type is a color image, the human body posture model obtained through training recognizes the human body key points in the image to be recognized, and determines the human body key points based on the recognized human body key points The position information of the person in the video image to be recognized includes:

Performing feature extraction on the video image to be recognized by using the pre-network layer of the human pose model to obtain a feature map corresponding to the video image to be recognized;

Extracting the human body key points of the human body in the video image to be recognized from the feature map by using the confidence network layer of the human pose model to obtain a key point confidence map corresponding to the human body key points in the video image to be recognized;

Extracting the correlation degree of each key point of the human body in the video image to be recognized from the feature map by using the correlation degree vector network layer of the human body pose model;

The person position information of the video image to be recognized is determined according to the correlation between the key point confidence map and the key points of the human body.
The device according to claim 11, wherein the determining the person position information of the video image to be recognized according to the correlation between the key point confidence map and the key points of the human body comprises:

According to the correlation degree of the key points of the human body, connect the key points of the human body on the key point confidence map, and calculate the outline of the key points;

Obtaining a minimum circumscribed rectangle according to the contour of the key point, where the minimum circumscribed rectangle is a rectangle with the smallest area including the contour of the key point;

The position information of the person in the video image to be recognized is determined according to the minimum circumscribed rectangle.
8. The device according to claim 8, wherein after obtaining the position information of the person to be identified, it further comprises:

Generating video information corresponding to the character position information;

Write the video information into the corresponding log.
The device according to claim 8, wherein, before the acquisition of the surveillance video file to be recognized, the preprocessing module is further used to train the human pose model and the lightweight target detection model; the training institute The steps of the human body pose model and the lightweight target detection model include:

Obtain historical surveillance videos of surveillance equipment;

Extracting color image samples and night vision image samples from the historical surveillance video, marking the human body key points in the color image samples, and marking the position coordinates of the human body in the night vision image samples , Get annotated color image and annotated night vision image;

Performing size adjustment on the labeled color image and the labeled night vision image to obtain a training color image and a training night vision image;

Mapping the human body key points in the training color image with the human body key points marked in the marked color image, and train the human body pose model using the mapped training color image;

The position coordinates in the training night vision image are mapped to the position coordinates marked in the marked night vision image, and the light-weight target detection model is trained using the mapped training night vision image.
A computer device, including a memory and a processor, the memory storing a computer program, wherein the processor implements the following steps when the processor executes the computer program:

Acquiring a surveillance video file to be identified, and preprocessing the surveillance video file to be identified to obtain a video image to be identified;

Determining the image type of the video image to be recognized;

When the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and determines the position information of the person in the video image to be recognized based on the recognized key points of the human body;

When the image type is a night vision image, the lightweight target detection model obtained through training recognizes the position information of the person in the video image to be recognized.
The computer device according to claim 15, wherein the step of determining the image type of the video image to be recognized comprises:

Acquiring the three-channel pixel value of each pixel in the video image to be identified;

Perform difference calculation based on the three-channel pixel values, and select the value with the largest difference as the pixel difference;

The image type of the video image to be recognized is determined according to the preset value and the pixel difference value.
The computer device according to claim 15, wherein said determining the image type of the video image to be recognized comprises:

Acquiring the acquisition mode adjustment time of the monitoring device corresponding to the surveillance video file to be identified, and acquiring the shooting time corresponding to the video image to be identified;

The image type of the video image to be recognized is determined according to the acquisition mode adjustment time.
The computer device according to claim 15, wherein, when the image type is a color image, the human body posture model obtained through training recognizes the key points of the human body in the image to be recognized, and determines based on the recognized key points of the human body The position information of the person in the video image to be recognized includes:

Performing feature extraction on the video image to be recognized by using the pre-network layer of the human pose model to obtain a feature map corresponding to the video image to be recognized;

Extracting the human body key points of the human body in the video image to be recognized from the feature map by using the confidence network layer of the human pose model to obtain a key point confidence map corresponding to the human body key points in the video image to be recognized;

Extracting the correlation degree of each key point of the human body in the video image to be recognized from the feature map by using the correlation degree vector network layer of the human body pose model;

The person position information of the video image to be recognized is determined according to the correlation between the key point confidence map and the key points of the human body.
The computer device according to claim 15, wherein after obtaining the position information of the person to be identified, it further comprises:

Generating video information corresponding to the character position information;

Write the video information into the corresponding log.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by a processor.