CN112287877A

CN112287877A - Multi-role close-up shot tracking method

Info

Publication number: CN112287877A
Application number: CN202011294296.4A
Authority: CN
Inventors: 方倩
Original assignee: Shanghai Sike Intelligent Technology Co ltd
Current assignee: Suzhou Aikor Intelligent Technology Co ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-01-29
Anticipated expiration: 2040-11-18
Also published as: CN112287877B

Abstract

The invention discloses a multi-role close-up shot tracking method, which comprises the following steps: acquiring multi-channel video data, constructing a deep learning model based on a CNN network, and respectively carrying out human body detection and face detection on the multi-channel video data through the deep learning model; respectively carrying out identity matching on people appearing in each path of video according to the human body detection result and the human face detection result; and respectively selecting the lens with the optimal view angle for different identity characters, and pushing the image and/or video stream with the optimal view angle corresponding to the different identity characters. The invention can accurately judge the effective character data in the monitoring system, improve the analysis capability of the video data, provide high-quality analysis results and provide close-up images and/or video streams of detected characters in real time.

Description

Multi-role close-up shot tracking method

Technical Field

The invention relates to the technical field of video image processing, in particular to a method for tracking a multi-angle close-up shot.

Background

At present, the deep learning technology is continuously developed and advanced, and becomes one of the most popular scientific trends at present. Convolutional Neural Networks (CNNs) are important algorithms in deep learning, are very good at handling image-related problems, are widely used in the field of computer vision today, and play an important role in face detection, image retrieval, and the like.

In the prior art, people to be monitored often need to be monitored and identified, most of the existing methods are used for monitoring the people to be monitored in real time by arranging video monitoring equipment, establishing a model for a large-scale data set obtained by monitoring, extracting characteristics and outputting related data of the people to be monitored, but in most scenes, monitoring and identifying by utilizing video data are difficult to effectively meet customized service requirements, such as: because child abusing events of the childbirth control mechanism occur frequently, in order to ensure the safety of children and keep the consistency of home care and childbirth care, a household often wants to check the monitoring video of the childbirth control mechanism in real time. This need is often difficult to meet, however, for the following reasons: (1) the privacy of other children can be invaded by directly using the traditional monitoring system to check the video; (2) the characteristic education content of the entrusting institution can be revealed by displaying the video of the full picture, so that the competitiveness of the entrusting institution is weakened; (3) even if some support mechanisms allow guardians to check monitoring videos on mobile phone software or computer software in real time, due to the fact that the shooting angle of the monitoring camera is fixed and the positions of infants move, the monitoring camera cannot change according to the actions of people to be detected in real time, the guardians cannot always see close-up shots of children, time is needed to be spent on positioning the children in each monitoring picture, the data checked by all the guardians are consistent, and no customized video data are distributed. Therefore, based on the above-mentioned practical problems, it is urgently needed to provide a close-up scene that can track the occurrence of a specific character in a specific scene in real time under multiple paths of monitoring videos in some scenes, and realize highly customized automatic generation service of video data so as to meet the customized data requirements in such scenes.

Disclosure of Invention

In order to solve the above technical problem, the present invention provides a method for tracking a multi-angle close-up shot. In the system and the method,

in order to achieve the purpose, the technical scheme of the invention is as follows:

a multi-character close-up shot tracking method, comprising the steps of:

obtaining multi-channel video data, constructing a deep learning model based on a CNN network, and passing the deep learning model

Respectively carrying out human body detection and human face detection on the multi-channel video data;

respectively carrying out identity matching on people appearing in each path of video according to the human body detection result and the human face detection result;

and respectively selecting the lens with the optimal view angle for different identity characters, and pushing the image and/or video stream with the optimal view angle corresponding to the different identity characters.

Preferably, before the step of pushing the optimal view angle images and/or video streams corresponding to the different identity characters, the method further includes the steps of performing corresponding central area interception on the person in the lens of the optimal view angle of each different identity character, and performing high-definition image restoration on the intercepted area.

Preferably, the shot with the optimal viewing angle is the shot with the largest number of key points for face detection in the multi-path video.

Preferably, the key points of the face detection include a left inner and outer eye corner, a nose heel point, a right inner and outer eye corner, a nose root point, a left nose wing, a right nose wing, a nose separation point, a left lip, a right lip, an upper lip, a lower lip and a mental point.

Preferably, the human body detection and the face detection are respectively performed on the multiple paths of video data, and the method specifically includes the following steps:

constructing a deep learning model based on a CNN network, extracting image features from multi-path video data, and completing the process

Generating a plurality of location box predictions and category predictions;

the method comprises the steps that loss calculation is carried out on a plurality of position frame predictions and category predictions and a label frame respectively to obtain corresponding loss values;

and updating parameters of the deep learning model according to the loss value.

Preferably, the loss function adopted by the position prediction is smooth L1 loss:

in the formula, x₁And x₂Both are the difference between the position prediction and the real position, and the value ranges of the position prediction and the real position are respectively-1<x₁<1,x₂>1 or x₂<-1。

Preferably, the loss function of the classification prediction is a cross-entropy function:

in the formula, y'_iIndex data tag, y_iRefers to the prediction probability value.

Preferably, the specific method for respectively performing identity matching on people in each video according to the human body detection result and the human face detection result comprises the following steps:

calling a pre-trained feature vector extraction model, and extracting feature vectors of characters in each path of video from the video stream;

calculating Euler distance between every two feature vectors;

according to the calculated Euler distance, obtaining a similarity result of people in each path of video;

and matching the identities of the people in each path of video according to the similarity result.

Preferably, the euler distance is calculated by the formula:

in the formula, m_iAnd n_iAre elements of any two sets of feature vectors in different video streams.

Preferably, the multi-channel video data is acquired from different angles through a plurality of monitoring acquisition devices.

Based on the technical scheme, the invention has the beneficial effects that: the invention takes a deep learning technology as a core, firstly solves the problems of pedestrian detection and face detection by utilizing a target detection technology in the deep learning, multiplexes the feature vectors of a detection network to complete the person identity matching of each path of video, automatically selects a group of shots with the best view angle for each person appearing in a scene, and generates a close-up shot of each person in real time under the shots with the best view angle. Even if the optimal visual angle changes due to the fact that the position of the person moves, the method and the device can capture the path with the optimal visual angle in all video streams all the time to process, track the close-up shot of the person all the time, provide the close-up image and/or the video stream of the detected person in real time, and improve the use experience of a user.

Drawings

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

FIG. 1: the invention relates to a flow chart of a multi-role close-up shot tracking method;

FIG. 2: the invention discloses a video stream input and output schematic diagram in a multi-role close-up shot tracking method;

FIG. 3: the invention relates to an algorithm function flow chart in a multi-role close-up shot tracking method;

FIG. 4: the invention relates to a deep learning training schematic diagram in a multi-role close-up shot tracking method;

FIG. 5: the invention discloses a task identity matching schematic diagram in a multi-role close-up shot tracking method.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Example one

As shown in fig. 1 to 5, the method for tracking a multi-angular close-up shot of the present invention comprises the following steps:

1. in certain scenarios (e.g., kindergarten, nursery, nursing home, etc.), multiple surveillance cameras capture video from multiple perspectives.

2. Inputting the multi-channel video data acquired in the step 1 into an AI inference edge server for comprehensive processing, wherein the specific processing steps are as shown in FIG. 1: and performing real-time human body detection and face detection on each path of video, so as to obtain human body detection results and face recognition results of corresponding multiple paths of video → performing identity matching on people appearing in each path of video, wherein the matching result is that each person with a specific identity corresponds to the monitoring video under multiple visual angles. → the selection of the optimal view angle for each person with a specific identity is carried out according to the number of key points detected by the human face, the more key points detected by the human face, the better the view angle is, the central region is intercepted for the optimal view angle of each person, and the exclusive 'close-up shot' of each person with a specific identity is obtained. → high-definition repair is carried out for the close-up shot of each specific identity figure, and the video stream output is pushed.

The algorithm work flow is illustrated by taking fig. 3 as an example. Assume that two video streams 1 and 2 are captured by different cameras at two viewing angles, and for the same scene, there are two people: adults and children. Firstly, a deep learning detection algorithm is used for carrying out human body detection and human face detection on videos at two visual angles. Video streams 1 and 2 each get two detection results ID1 and ID 2. The ID detected by the two video streams is matched, and as a result, the ID2 under the video stream 1 corresponds to the ID1 under the video stream 2 and is a person (adult) with the same identity, and the ID1 under the video stream 1 corresponds to the ID2 under the video stream 2 and is a person (child) with the same identity. Thus, an adult and a child have shots from two perspectives, for the adult, that appear in both ID1 in video 1 and ID2 in video stream 2. Next, comparing the two lenses, and selecting the best visual angle of the adult according to the number of key points of the face detection, wherein the key points of the face detection comprise a left inner and outer eye angle, a nose heel point, a right inner and outer eye angle, a nose root point, a left nose wing, a right nose wing, a nose septal point, a left lip, a right lip, an upper lip, a lower lip and a front chin point. It is found that there are more key points detectable by the adult under the lens 2, and the visual angle of the lens 2 for the adult is better. Also, the child's angle of view under the lens 1 is better. And then, performing center area interception on the optimal shots of the two characters, and performing push video streaming.

Implementation details:

and respectively carrying out human body detection and face detection on the multi-path video data: the target detection network using deep learning as the core is not limited to single-stage, double-stage or anchor free, anchor base and other frameworks.

The training process is as shown in fig. 4, a large amount of image data with a character position label box is needed, a deep learning model is input, the model takes a CNN (continuous neural network) as a main network, image features are gradually extracted, finally predictions of a plurality of human body/human face position boxes and predictions of corresponding categories are output, loss calculation is performed on the prediction results and the label box, loss values of a batch of data by the network are obtained, and parameters of the deep learning model are updated according to the loss values, so that the position predictions and the category predictions of subsequent data predictions by the model are closer to true values.

Wherein, the loss function of the position prediction is smooth L1 loss:

The loss function for class prediction is cross entropy loss:

in the formula, y'_iIndex data tag, y_iRefers to the prediction probability value. It can be seen that human detection/face detection is a training process for multi-task learning.

And respectively carrying out identity ID matching on people appearing in each path of video according to the results of human body detection and human face detection: in order to match the identities of people in different video streams (visual angles), the invention adopts a mode of calculating human body detection/human face detection and outputting the Euler distance of features. The specific method is as shown in fig. 5, the detection model performs inference calculation on different videos to obtain corresponding output features, and performs euler distance calculation on features corresponding to different video stream detection results to perform person identity matching. Essentially, two euler distances between a plurality of feature vectors are calculated, and the euler distance calculation formula is as follows:

The above description is only a preferred embodiment of the multi-angular close-up tracing method disclosed in the present invention, and is not intended to limit the scope of the embodiments of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present disclosure should be included in the protection scope of the embodiments of the present disclosure.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are all described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A method for multi-angular close-up shot tracking, comprising the steps of:

acquiring multi-channel video data, constructing a deep learning model based on a CNN network, and respectively carrying out human body detection and face detection on the multi-channel video data through the deep learning model;

2. The method for multi-character close-up shot tracking according to claim 1, wherein before the step of pushing the optimal perspective images and/or video streams corresponding to different identity characters, the method further comprises the steps of intercepting the corresponding central area of each of the different identity characters within the optimal perspective shot, and performing high-definition image restoration on the intercepted area.

3. The method as claimed in claim 1, wherein the shots with the best view angle are shots with the largest number of key points detected by the face in the multi-path video.

4. The method as claimed in claim 3, wherein the key points of face detection include left inner and outer corners of eyes, nose heel point, right inner and outer corners of eyes, nose root point, left nose wing, right nose wing, nose diaphragm point, left and right lips, upper and lower lips, and mental front point.

5. The method for multi-angular close-up shot tracking according to claim 1, wherein the steps of performing human body detection and human face detection on the multi-channel video data respectively comprise:

constructing a deep learning model based on a CNN network, extracting image features from multi-path video data, and completing prediction of a plurality of position frames and category prediction;

the method comprises the steps that loss calculation is carried out on a plurality of position frame predictions and category predictions and a label frame respectively to obtain corresponding loss values; and updating parameters of the deep learning model according to the loss value.

6. A multi-feature close-up tracing method according to claim 5, wherein said position prediction uses a loss function of smooth L1 loss:

7. A method of multi-feature close-up tracing according to claim 5, wherein said classification predicted loss function is a cross-entropy function:

8. The method for multi-angular close-up shot tracking according to claim 1, wherein the specific method for respectively matching the identities of the people in each video according to the results of human body detection and human face detection comprises:

calculating Euler distance between every two feature vectors;

9. A method of multi-feature close-up tracing, according to claim 8, wherein said euler distance is calculated by the formula:

10. The method of claim 1, wherein the multiple video data is captured from different angles by a plurality of surveillance capture devices.