WO2021196409A1

WO2021196409A1 - Video figure retrieval method and retrieval system based on deep learning

Info

Publication number: WO2021196409A1
Application number: PCT/CN2020/096015
Authority: WO
Inventors: 杨唤晨; 谢恩鹏; 徐杰; 李帅
Original assignee: 山东云缦智能科技有限公司
Priority date: 2020-04-01
Filing date: 2020-06-15
Publication date: 2021-10-07
Also published as: CN111460226A

Abstract

A video figure retrieval method and retrieval system based on deep learning. A digital video is decoded according to a frame rate and is pre-processed, such that frames or fragments can be decoded from the digital video. Facial information is acquired by using a pre-trained deep neural network, and a facial picture is then converted into a feature vector by using a Facenet network; the feature value of the facial picture of a specific figure is extracted using the Facenet network; and the distance between the feature vector and the characteristic centroid of the figure is then calculated by using a formula, and whether the figure is a specific figure is determined by means of comparing the distance with a feature hemisphere r, thereby facilitating various application scenarios, such as a service provider searching a server for videos including the specific figure.

Description

A video character retrieval method and retrieval system based on deep learning

Technical field

The invention relates to the technical field of video face retrieval, in particular to a video person retrieval method and retrieval system based on deep learning.

Background technique

In recent years, video applications such as streaming media and IPTV have developed rapidly, and activities such as watching online dramas and watching digital TV have become important entertainment methods for people. Cisco's VNI predicts that by 2022, IP video traffic will account for 82% of Internet IP traffic. In this context, people have a strong demand for more diverse and more convenient video services. Therefore, how to search for a character in a video, find a segment of an interested celebrity in a movie, or search whether a certain character appears in a surveillance video, and search for a video containing a specific character in a movie library has become a problem that needs to be solved.

Summary of the invention

In order to overcome the shortcomings of the above technologies, the present invention provides a method and system that can enable streaming media service providers and smart set-top box service providers to perform character retrieval in videos.

The technical solutions adopted by the present invention to overcome its technical problems are:

A video character retrieval method based on deep learning includes the following steps:

a) Decode the digital video file according to its frame rate;

b) Preprocess the decoded digital video frames and convert them to grayscale images;

c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);

d) Perform preprocessing operations on the intercepted face images;

e) Input the preprocessed face image into the Facenet network, and the Facenet network converts the face image into an N-dimensional feature vector V _unkonwn ;

f) Input the face pictures of i specific persons to be recognized into the Facenet network, and the Facenet network extracts the characteristic value V _{target,i of the} face pictures of the specific persons, according to the formula

Calculate the characteristic centroid cen _{i of} a specific person, ρ _i is the confidence factor of the face picture of the i-th specific person, 0<ρ _i ≤1;

g) through the formula

_{Calculate the distance l cen} between the characteristic vector V _unkonwn and the characteristic centroid of the character. If l _{cen is} smaller than the characteristic sphere radius r, then the character is determined to be a specific character, and if l _{cen is} greater than or equal to the characteristic sphere radius r, it is determined that the character is not a specific character;

g) The service provider finds all the videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.

Further, in step a), the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing.

Preferably, in step b), the decoded digital video frame is reduced to a fixed size in a length-to-width equal ratio, and then converted into a grayscale image after the reduction.

Further, the preprocessing operation in step d) includes the following steps:

d-1) If the intercepted face image is square, scale the intercepted face image to a square image of M×M pixels;

d-2) If the captured face image is not square, use black borders to fill the image into a square image and then scale it to a square image of M×M pixels.

Preferably, N is 128 in step e).

Preferably, M is 160 in step d).

A video person retrieval system based on deep learning, including: a video decoding unit, a face detection unit, and a face feature extraction unit;

The video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;

The face detection unit includes a deep neural network and a preprocessing unit. The deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;

The face feature extraction unit is composed of the Facenet network.

The beneficial effects of the present invention are: frames or segments can be decoded in the digital video by decoding the digital video according to the frame rate and preprocessing, using the pre-trained deep neural network to obtain the face information, and then using the Facenet network to convert the human The face image is converted into a feature vector, and the feature value of the face image of a specific person is extracted using the Facenet network. Then, the distance between the feature vector and the feature centroid of the person is calculated using a formula, and the distance is compared with the feature hemisphere r to determine whether it is a specific person. This facilitates service providers to search for multiple types of application scenarios such as videos containing specific characters on the server.

Description of the drawings

Figure 1 is a system structure diagram of the present invention.

Detailed ways

The present invention will be further described below in conjunction with Figure 1 and Figure 2.

a) Decode the digital video file according to its frame rate;

d) Perform preprocessing operations on the intercepted face images;

g) through the formula

After the digital video is decoded at the frame rate and preprocessed, the frames or segments can be decoded in the digital video. The pre-trained deep neural network is used to obtain the face information and then the Facenet network is used to convert the face image into a feature vector. Use the Facenet network to extract the feature value of the face picture of a specific person, and then use a formula to calculate the distance between the feature vector and the feature centroid of the person, and compare the distance with the feature hemisphere r to determine whether it is a specific person, so as to facilitate the service provider in the server Search multiple types of application scenarios, including videos containing specific characters.

Preferably, in step a), the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing. By setting the step size, only one frame in each s frames decoded is sent to the subsequent preprocessing unit, which saves system resources and improves system operating speed.

Further, if the size of the frame is too large, in step b), the decoded digital video frame is reduced to a fixed size according to the aspect ratio, and then converted to a grayscale image, which can speed up face detection. The speed of execution.

Further, the preprocessing operation in step d) includes the following steps:

Preferably, N is 128 in step e).

Preferably, M is 160 in step d).

The face feature extraction unit is composed of the Facenet network.

The above descriptions are only preferred embodiments of the present invention, which are only used to illustrate the technical solutions of the present invention, and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims

A method for retrieving video characters based on deep learning, which is characterized in that it comprises the following steps:

a) Decode the digital video file according to its frame rate;

b) Preprocessing the decoded digital video frames;

c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);

d) Perform preprocessing operations on the intercepted face images;

e) Input the preprocessed face image into the Facenet network, and the Facenet network converts the face image into an N-dimensional feature vector V unkonwn ;

f) Input the face pictures of i specific persons to be recognized into the Facenet network, and the Facenet network extracts the characteristic value V target,i of the face pictures of the specific persons, according to the formula
Calculate the characteristic centroid cen i of a specific person, ρ i is the confidence factor of the face picture of the i-th specific person, 0<ρ i ≤1;

g) through the formula
Calculate the distance l cen between the characteristic vector V unkonwn and the characteristic centroid of the character. If l cen is smaller than the characteristic sphere radius r, then the character is determined to be a specific character, and if l cen is greater than or equal to the characteristic sphere radius r, it is determined that the character is not a specific character;

g) The service provider finds all videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.
The method of video character retrieval based on deep learning according to claim 1, characterized in that: in step a), the frame step size is set to s, and s is a positive integer greater than or equal to 1, and every s in the digital video file is decoded. Select one of the frames and send it to step b) for processing.
The method of video person retrieval based on deep learning according to claim 1, characterized in that: in step b), the decoded digital video frame is reduced to a fixed size according to the aspect ratio, and then converted into Grayscale image.
The method of video character retrieval based on deep learning according to claim 1, wherein the preprocessing operation in step d) comprises the following steps:

d-1) If the intercepted face image is square, scale the intercepted face image to a square image of M×M pixels;

d-2) If the captured face image is not square, use black borders to fill the image into a square image and then scale it to a square image of M×M pixels.
The method for retrieving video characters based on deep learning according to claim 1, characterized in that: in step e), N is 128.
The method for retrieving video characters based on deep learning according to claim 4, wherein M is 160 in step d).
A retrieval system implementing the method of video person retrieval based on deep learning of claim 1, characterized in that it comprises: a video decoding unit, a face detection unit, and a face feature extraction unit;

The video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;

The face detection unit includes a deep neural network and a preprocessing unit. The deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;

The face feature extraction unit is composed of the Facenet network.