WO2021196409A1 - Video figure retrieval method and retrieval system based on deep learning - Google Patents

Video figure retrieval method and retrieval system based on deep learning Download PDF

Info

Publication number
WO2021196409A1
WO2021196409A1 PCT/CN2020/096015 CN2020096015W WO2021196409A1 WO 2021196409 A1 WO2021196409 A1 WO 2021196409A1 CN 2020096015 W CN2020096015 W CN 2020096015W WO 2021196409 A1 WO2021196409 A1 WO 2021196409A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
face
frame
image
deep learning
Prior art date
Application number
PCT/CN2020/096015
Other languages
French (fr)
Chinese (zh)
Inventor
杨唤晨
谢恩鹏
徐杰
李帅
Original Assignee
山东云缦智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东云缦智能科技有限公司 filed Critical 山东云缦智能科技有限公司
Publication of WO2021196409A1 publication Critical patent/WO2021196409A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/66Analysis of geometric attributes of image moments or centre of gravity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints
    • G06V40/1347Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints
    • G06V40/1365Matching; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the invention relates to the technical field of video face retrieval, in particular to a video person retrieval method and retrieval system based on deep learning.
  • the present invention provides a method and system that can enable streaming media service providers and smart set-top box service providers to perform character retrieval in videos.
  • step c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);
  • the service provider finds all the videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.
  • the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing.
  • the decoded digital video frame is reduced to a fixed size in a length-to-width equal ratio, and then converted into a grayscale image after the reduction.
  • step d) includes the following steps:
  • N is 128 in step e).
  • M is 160 in step d).
  • a video person retrieval system based on deep learning including: a video decoding unit, a face detection unit, and a face feature extraction unit;
  • the video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;
  • the face detection unit includes a deep neural network and a preprocessing unit.
  • the deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;
  • the face feature extraction unit is composed of the Facenet network.
  • frames or segments can be decoded in the digital video by decoding the digital video according to the frame rate and preprocessing, using the pre-trained deep neural network to obtain the face information, and then using the Facenet network to convert the human
  • the face image is converted into a feature vector, and the feature value of the face image of a specific person is extracted using the Facenet network.
  • the distance between the feature vector and the feature centroid of the person is calculated using a formula, and the distance is compared with the feature hemisphere r to determine whether it is a specific person.
  • Figure 1 is a system structure diagram of the present invention.
  • step c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);
  • the service provider finds all the videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.
  • the frames or segments can be decoded in the digital video.
  • the pre-trained deep neural network is used to obtain the face information and then the Facenet network is used to convert the face image into a feature vector.
  • Use the Facenet network to extract the feature value of the face picture of a specific person, and then use a formula to calculate the distance between the feature vector and the feature centroid of the person, and compare the distance with the feature hemisphere r to determine whether it is a specific person, so as to facilitate the service provider in the server Search multiple types of application scenarios, including videos containing specific characters.
  • the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing.
  • step a the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing.
  • step b the decoded digital video frame is reduced to a fixed size according to the aspect ratio, and then converted to a grayscale image, which can speed up face detection.
  • the speed of execution is not limited to the range of possible ranges.
  • step d) includes the following steps:
  • N is 128 in step e).
  • M is 160 in step d).
  • a video person retrieval system based on deep learning including: a video decoding unit, a face detection unit, and a face feature extraction unit;
  • the video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;
  • the face detection unit includes a deep neural network and a preprocessing unit.
  • the deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;
  • the face feature extraction unit is composed of the Facenet network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Geometry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

A video figure retrieval method and retrieval system based on deep learning. A digital video is decoded according to a frame rate and is pre-processed, such that frames or fragments can be decoded from the digital video. Facial information is acquired by using a pre-trained deep neural network, and a facial picture is then converted into a feature vector by using a Facenet network; the feature value of the facial picture of a specific figure is extracted using the Facenet network; and the distance between the feature vector and the characteristic centroid of the figure is then calculated by using a formula, and whether the figure is a specific figure is determined by means of comparing the distance with a feature hemisphere r, thereby facilitating various application scenarios, such as a service provider searching a server for videos including the specific figure.

Description

一种基于深度学习的视频人物检索方法及检索系统A video character retrieval method and retrieval system based on deep learning 技术领域Technical field
本发明涉及视频人脸检索技术领域,具体涉及一种基于深度学习的视频人物检索方法及检索系统。The invention relates to the technical field of video face retrieval, in particular to a video person retrieval method and retrieval system based on deep learning.
背景技术Background technique
近年来,流媒体、IPTV等视频应用发展迅速,追网剧、看数字电视等活动成为人们的重要娱乐方式。思科的VNI预测,到2022年IP视频流量将占据国际互联网IP流量的82%。在此背景下,人们对更多样、更便捷的视频服务产生了旺盛需求。因此如何在视频中检索人物,实现在影片中找到感兴趣明星出现的片段或者在监控视频中检索某人物是否出现以及在影视库中搜索包含特定人物的视频成为需要解决的问题。In recent years, video applications such as streaming media and IPTV have developed rapidly, and activities such as watching online dramas and watching digital TV have become important entertainment methods for people. Cisco's VNI predicts that by 2022, IP video traffic will account for 82% of Internet IP traffic. In this context, people have a strong demand for more diverse and more convenient video services. Therefore, how to search for a character in a video, find a segment of an interested celebrity in a movie, or search whether a certain character appears in a surveillance video, and search for a video containing a specific character in a movie library has become a problem that needs to be solved.
发明内容Summary of the invention
本发明为了克服以上技术的不足,提供了一种可以使流媒体服务商、智能机顶盒服务商进行视频中人物检索的方法及系统。In order to overcome the shortcomings of the above technologies, the present invention provides a method and system that can enable streaming media service providers and smart set-top box service providers to perform character retrieval in videos.
本发明克服其技术问题所采用的技术方案是:The technical solutions adopted by the present invention to overcome its technical problems are:
一种基于深度学习的视频人物检索方法,包括如下步骤:A video character retrieval method based on deep learning includes the following steps:
a)将数字视频文件按其帧率进行解码;a) Decode the digital video file according to its frame rate;
b)将解码出的数字视频的帧进行预处理,将其转换为灰度图;b) Preprocess the decoded digital video frames and convert them to grayscale images;
c)将预处理后的帧输入已预先训练好的深度神经网络,如果灰度图中存在人脸,则深度神经网络输出帧内所有人脸的位置并截取人脸,如果帧内不存在人脸,则返回步骤a);c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);
d)将截取的人脸图像进行预处理操作;d) Perform preprocessing operations on the intercepted face images;
e)将预处理后的人脸图像输入Facenet网络,Facenet网络将人脸图片转化为N维的特征向量V unkonwne) Input the preprocessed face image into the Facenet network, and the Facenet network converts the face image into an N-dimensional feature vector V unkonwn ;
f)将需要识别的i个特定人物的人脸图片输入Facenet网络,Facenet网络提取 特定人物的人脸图片的特征值V target,i,根据公式
Figure PCTCN2020096015-appb-000001
计算特定人物的特征质心cen i,ρ i为第i个特定人物的人脸图片的置信因子,0<ρ i≤1;
f) Input the face pictures of i specific persons to be recognized into the Facenet network, and the Facenet network extracts the characteristic value V target,i of the face pictures of the specific persons, according to the formula
Figure PCTCN2020096015-appb-000001
Calculate the characteristic centroid cen i of a specific person, ρ i is the confidence factor of the face picture of the i-th specific person, 0<ρ i ≤1;
g)通过公式
Figure PCTCN2020096015-appb-000002
计算特性向量V unkonwn与人物的特征质心的距离l cen,如果l cen小于特征球半径r则判定该人物为特定人物,如果l cen大于等于特征球半径r则判定该人物不为特定人物;
g) through the formula
Figure PCTCN2020096015-appb-000002
Calculate the distance l cen between the characteristic vector V unkonwn and the characteristic centroid of the character. If l cen is smaller than the characteristic sphere radius r, then the character is determined to be a specific character, and if l cen is greater than or equal to the characteristic sphere radius r, it is determined that the character is not a specific character;
g)服务商在服务器中寻找到包含该特定人物帧的所有视频,当用户需要观看该特定人物的视频时,服务商为用户跳转到包含该特定人物帧的视频。g) The service provider finds all the videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.
进一步的,步骤a)中设置帧步长为s,s为大于等于1的正整数,将数字视频文件中解码出的每s个帧中选取一个帧送入步骤b)处理。Further, in step a), the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing.
优选的,步骤b)中将解码出的数字视频的帧按长宽等比的方式缩小至固定尺寸,缩小后将其转换为灰度图。Preferably, in step b), the decoded digital video frame is reduced to a fixed size in a length-to-width equal ratio, and then converted into a grayscale image after the reduction.
进一步的,步骤d)中预处理操作包括以下步骤:Further, the preprocessing operation in step d) includes the following steps:
d-1)如果截取的人脸图像为正方形,则将截取的人脸图像缩放至M×M像素的正方形图像;d-1) If the intercepted face image is square, scale the intercepted face image to a square image of M×M pixels;
d-2)如果截取的人脸图像不为正方形,则使用黑边将图像补为正方形图像后再缩放至M×M像素的正方形图像。d-2) If the captured face image is not square, use black borders to fill the image into a square image and then scale it to a square image of M×M pixels.
优选的,步骤e)中N为128。Preferably, N is 128 in step e).
优选的,步骤d)中M为160。Preferably, M is 160 in step d).
一种基于深度学习的视频人物检索系统,包括:视频解码单元、人脸检测单元以及人脸特征提取单元;A video person retrieval system based on deep learning, including: a video decoding unit, a face detection unit, and a face feature extraction unit;
所述视频解码单元包括视频解码单元和预处理单元,视频解码单元将数字视频文件按其帧率进行解码,预处理单元将解码出的数字视频的帧进行预处理;The video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;
人脸检测单元包括深度神经网络及预处理单元,深度神经网络输出帧内所有人脸的位置坐标并截取人脸后通过预处理单元预处理;The face detection unit includes a deep neural network and a preprocessing unit. The deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;
人脸特征提取单元由Facenet网络构成。The face feature extraction unit is composed of the Facenet network.
本发明的有益效果是:通过将数字视频按帧率解码并进行预处理后可以在数字视频中解码出帧或片段,利用预先训练好的深度神经网络获取人脸信息后再利用Facenet网络将人脸图片转化为特征向量,利用Facenet网络提取特定人物的人脸图片的特征值,之后利用公式计算特性向量与人物的特征质心的距离,通过将距离与特征半球r的比较判定是否为特定人物,从而方便服务商在服务器中搜索包含特定人物的视频等多类应用场景。The beneficial effects of the present invention are: frames or segments can be decoded in the digital video by decoding the digital video according to the frame rate and preprocessing, using the pre-trained deep neural network to obtain the face information, and then using the Facenet network to convert the human The face image is converted into a feature vector, and the feature value of the face image of a specific person is extracted using the Facenet network. Then, the distance between the feature vector and the feature centroid of the person is calculated using a formula, and the distance is compared with the feature hemisphere r to determine whether it is a specific person. This facilitates service providers to search for multiple types of application scenarios such as videos containing specific characters on the server.
附图说明Description of the drawings
图1为本发明的系统结构图。Figure 1 is a system structure diagram of the present invention.
具体实施方式Detailed ways
下面结合附图1、附图2对本发明做进一步说明。The present invention will be further described below in conjunction with Figure 1 and Figure 2.
一种基于深度学习的视频人物检索方法,包括如下步骤:A video character retrieval method based on deep learning includes the following steps:
a)将数字视频文件按其帧率进行解码;a) Decode the digital video file according to its frame rate;
b)将解码出的数字视频的帧进行预处理,将其转换为灰度图;b) Preprocess the decoded digital video frames and convert them to grayscale images;
c)将预处理后的帧输入已预先训练好的深度神经网络,如果灰度图中存在人脸,则深度神经网络输出帧内所有人脸的位置并截取人脸,如果帧内不存在人脸,则返回步骤a);c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);
d)将截取的人脸图像进行预处理操作;d) Perform preprocessing operations on the intercepted face images;
e)将预处理后的人脸图像输入Facenet网络,Facenet网络将人脸图片转化为N维的特征向量V unkonwne) Input the preprocessed face image into the Facenet network, and the Facenet network converts the face image into an N-dimensional feature vector V unkonwn ;
f)将需要识别的i个特定人物的人脸图片输入Facenet网络,Facenet网络提取特定人物的人脸图片的特征值V target,i,根据公式
Figure PCTCN2020096015-appb-000003
计算特定人物的特征质心cen i,ρ i为第i个特定人物的人脸图片的置信因子, 0<ρ i≤1;
f) Input the face pictures of i specific persons to be recognized into the Facenet network, and the Facenet network extracts the characteristic value V target,i of the face pictures of the specific persons, according to the formula
Figure PCTCN2020096015-appb-000003
Calculate the characteristic centroid cen i of a specific person, ρ i is the confidence factor of the face picture of the i-th specific person, 0<ρ i ≤1;
g)通过公式
Figure PCTCN2020096015-appb-000004
计算特性向量V unkonwn与人物的特征质心的距离l cen,如果l cen小于特征球半径r则判定该人物为特定人物,如果l cen大于等于特征球半径r则判定该人物不为特定人物;
g) through the formula
Figure PCTCN2020096015-appb-000004
Calculate the distance l cen between the characteristic vector V unkonwn and the characteristic centroid of the character. If l cen is smaller than the characteristic sphere radius r, then the character is determined to be a specific character, and if l cen is greater than or equal to the characteristic sphere radius r, it is determined that the character is not a specific character;
g)服务商在服务器中寻找到包含该特定人物帧的所有视频,当用户需要观看该特定人物的视频时,服务商为用户跳转到包含该特定人物帧的视频。g) The service provider finds all the videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.
通过将数字视频按帧率解码并进行预处理后可以在数字视频中解码出帧或片段,利用预先训练好的深度神经网络获取人脸信息后再利用Facenet网络将人脸图片转化为特征向量,利用Facenet网络提取特定人物的人脸图片的特征值,之后利用公式计算特性向量与人物的特征质心的距离,通过将距离与特征半球r的比较判定是否为特定人物,从而方便服务商在服务器中搜索包含特定人物的视频等多类应用场景。After the digital video is decoded at the frame rate and preprocessed, the frames or segments can be decoded in the digital video. The pre-trained deep neural network is used to obtain the face information and then the Facenet network is used to convert the face image into a feature vector. Use the Facenet network to extract the feature value of the face picture of a specific person, and then use a formula to calculate the distance between the feature vector and the feature centroid of the person, and compare the distance with the feature hemisphere r to determine whether it is a specific person, so as to facilitate the service provider in the server Search multiple types of application scenarios, including videos containing specific characters.
优选的,步骤a)中设置帧步长为s,s为大于等于1的正整数,将数字视频文件中解码出的每s个帧中选取一个帧送入步骤b)处理。通过设置步长,使得解码出的每个s个帧中只有一个帧送入后面的预处理单元,节省系统资源,提高系统运行速度。Preferably, in step a), the frame step size is set to s, and s is a positive integer greater than or equal to 1, and one frame is selected from every s frames decoded in the digital video file and sent to step b) for processing. By setting the step size, only one frame in each s frames decoded is sent to the subsequent preprocessing unit, which saves system resources and improves system operating speed.
进一步的,如果帧的大小过大,步骤b)中将解码出的数字视频的帧按长宽等比的方式缩小至固定尺寸,缩小后将其转换为灰度图,从而可以加快人脸检测的执行速度。Further, if the size of the frame is too large, in step b), the decoded digital video frame is reduced to a fixed size according to the aspect ratio, and then converted to a grayscale image, which can speed up face detection. The speed of execution.
进一步的,步骤d)中预处理操作包括以下步骤:Further, the preprocessing operation in step d) includes the following steps:
d-1)如果截取的人脸图像为正方形,则将截取的人脸图像缩放至M×M像素的正方形图像;d-1) If the intercepted face image is square, scale the intercepted face image to a square image of M×M pixels;
d-2)如果截取的人脸图像不为正方形,则使用黑边将图像补为正方形图像后再缩放至M×M像素的正方形图像。d-2) If the captured face image is not square, use black borders to fill the image into a square image and then scale it to a square image of M×M pixels.
优选的,步骤e)中N为128。Preferably, N is 128 in step e).
优选的,步骤d)中M为160。Preferably, M is 160 in step d).
一种基于深度学习的视频人物检索系统,包括:视频解码单元、人脸检测单元以及人脸特征提取单元;A video person retrieval system based on deep learning, including: a video decoding unit, a face detection unit, and a face feature extraction unit;
所述视频解码单元包括视频解码单元和预处理单元,视频解码单元将数字视频文件按其帧率进行解码,预处理单元将解码出的数字视频的帧进行预处理;The video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;
人脸检测单元包括深度神经网络及预处理单元,深度神经网络输出帧内所有人脸的位置坐标并截取人脸后通过预处理单元预处理;The face detection unit includes a deep neural network and a preprocessing unit. The deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;
人脸特征提取单元由Facenet网络构成。The face feature extraction unit is composed of the Facenet network.
以上所述仅为本发明的较佳实施例,仅用于说明本发明的技术方案,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所做的任何修改、等同替换、改进等,均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, which are only used to illustrate the technical solutions of the present invention, and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are all included in the protection scope of the present invention.

Claims (7)

  1. 一种基于深度学习的视频人物检索方法,其特征在于,包括如下步骤:A method for retrieving video characters based on deep learning, which is characterized in that it comprises the following steps:
    a)将数字视频文件按其帧率进行解码;a) Decode the digital video file according to its frame rate;
    b)将解码出的数字视频的帧进行预处理;b) Preprocessing the decoded digital video frames;
    c)将预处理后的帧输入已预先训练好的深度神经网络,如果灰度图中存在人脸,则深度神经网络输出帧内所有人脸的位置并截取人脸,如果帧内不存在人脸,则返回步骤a);c) Input the pre-processed frame into the pre-trained deep neural network. If there is a human face in the grayscale image, the deep neural network outputs the position of all faces in the frame and intercepts the human face. If there is no human face in the frame Face, then return to step a);
    d)将截取的人脸图像进行预处理操作;d) Perform preprocessing operations on the intercepted face images;
    e)将预处理后的人脸图像输入Facenet网络,Facenet网络将人脸图片转化为N维的特征向量V unkonwne) Input the preprocessed face image into the Facenet network, and the Facenet network converts the face image into an N-dimensional feature vector V unkonwn ;
    f)将需要识别的i个特定人物的人脸图片输入Facenet网络,Facenet网络提取特定人物的人脸图片的特征值V target,i,根据公式
    Figure PCTCN2020096015-appb-100001
    计算特定人物的特征质心cen i,ρ i为第i个特定人物的人脸图片的置信因子,0<ρ i≤1;
    f) Input the face pictures of i specific persons to be recognized into the Facenet network, and the Facenet network extracts the characteristic value V target,i of the face pictures of the specific persons, according to the formula
    Figure PCTCN2020096015-appb-100001
    Calculate the characteristic centroid cen i of a specific person, ρ i is the confidence factor of the face picture of the i-th specific person, 0<ρ i ≤1;
    g)通过公式
    Figure PCTCN2020096015-appb-100002
    计算特性向量V unkonwn与人物的特征质心的距离l cen,如果l cen小于特征球半径r则判定该人物为特定人物,如果l cen大于等于特征球半径r则判定该人物不为特定人物;
    g) through the formula
    Figure PCTCN2020096015-appb-100002
    Calculate the distance l cen between the characteristic vector V unkonwn and the characteristic centroid of the character. If l cen is smaller than the characteristic sphere radius r, then the character is determined to be a specific character, and if l cen is greater than or equal to the characteristic sphere radius r, it is determined that the character is not a specific character;
    g)服务商在服务器中寻找到包含该特定人物帧的所有视频,当用户需要观看该特定人物的视频时,服务商为用户跳转到包含该特定人物帧的视频。g) The service provider finds all videos containing the frame of the specific person in the server, and when the user needs to watch the video of the specific person, the service provider jumps to the video containing the frame of the specific person for the user.
  2. 根据权利要求1所述的基于深度学习的视频人物检索方法,其特征在于:步骤a)中设置帧步长为s,s为大于等于1的正整数,将数字视频文件中解码出的每s个帧中选取一个帧送入步骤b)处理。The method of video character retrieval based on deep learning according to claim 1, characterized in that: in step a), the frame step size is set to s, and s is a positive integer greater than or equal to 1, and every s in the digital video file is decoded. Select one of the frames and send it to step b) for processing.
  3. 根据权利要求1所述的基于深度学习的视频人物检索方法,其特征在于:步骤b)中将解码出的数字视频的帧按长宽等比的方式缩小至固定尺寸,缩小后 将其转换为灰度图。The method of video person retrieval based on deep learning according to claim 1, characterized in that: in step b), the decoded digital video frame is reduced to a fixed size according to the aspect ratio, and then converted into Grayscale image.
  4. 根据权利要求1所述的基于深度学习的视频人物检索方法,其特征在于,步骤d)中预处理操作包括以下步骤:The method of video character retrieval based on deep learning according to claim 1, wherein the preprocessing operation in step d) comprises the following steps:
    d-1)如果截取的人脸图像为正方形,则将截取的人脸图像缩放至M×M像素的正方形图像;d-1) If the intercepted face image is square, scale the intercepted face image to a square image of M×M pixels;
    d-2)如果截取的人脸图像不为正方形,则使用黑边将图像补为正方形图像后再缩放至M×M像素的正方形图像。d-2) If the captured face image is not square, use black borders to fill the image into a square image and then scale it to a square image of M×M pixels.
  5. 根据权利要求1所述的基于深度学习的视频人物检索方法,其特征在于:步骤e)中N为128。The method for retrieving video characters based on deep learning according to claim 1, characterized in that: in step e), N is 128.
  6. 根据权利要求4所述的基于深度学习的视频人物检索方法,其特征在于:步骤d)中M为160。The method for retrieving video characters based on deep learning according to claim 4, wherein M is 160 in step d).
  7. 一种实施权利要求1所述的基于深度学习的视频人物检索方法的检索系统,其特征在于,包括:视频解码单元、人脸检测单元以及人脸特征提取单元;A retrieval system implementing the method of video person retrieval based on deep learning of claim 1, characterized in that it comprises: a video decoding unit, a face detection unit, and a face feature extraction unit;
    所述视频解码单元包括视频解码单元和预处理单元,视频解码单元将数字视频文件按其帧率进行解码,预处理单元将解码出的数字视频的帧进行预处理;The video decoding unit includes a video decoding unit and a preprocessing unit, the video decoding unit decodes the digital video file according to its frame rate, and the preprocessing unit preprocesses the decoded digital video frames;
    人脸检测单元包括深度神经网络及预处理单元,深度神经网络输出帧内所有人脸的位置坐标并截取人脸后通过预处理单元预处理;The face detection unit includes a deep neural network and a preprocessing unit. The deep neural network outputs the position coordinates of all faces in the frame and intercepts the faces before preprocessing by the preprocessing unit;
    人脸特征提取单元由Facenet网络构成。The face feature extraction unit is composed of the Facenet network.
PCT/CN2020/096015 2020-04-01 2020-06-15 Video figure retrieval method and retrieval system based on deep learning WO2021196409A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010249216.7A CN111460226A (en) 2020-04-01 2020-04-01 Video character retrieval method and retrieval system based on deep learning
CN202010249216.7 2020-04-01

Publications (1)

Publication Number Publication Date
WO2021196409A1 true WO2021196409A1 (en) 2021-10-07

Family

ID=71682499

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096015 WO2021196409A1 (en) 2020-04-01 2020-06-15 Video figure retrieval method and retrieval system based on deep learning

Country Status (2)

Country Link
CN (1) CN111460226A (en)
WO (1) WO2021196409A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705422B (en) * 2021-08-25 2024-04-09 山东浪潮超高清视频产业有限公司 Method for obtaining character video clips through human faces

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106658169A (en) * 2016-12-18 2017-05-10 北京工业大学 Universal method for segmenting video news in multi-layered manner based on deep learning
CN107911748A (en) * 2017-11-24 2018-04-13 南京融升教育科技有限公司 A kind of video method of cutting out based on recognition of face
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment
CN108647621A (en) * 2017-11-16 2018-10-12 福建师范大学福清分校 A kind of video analysis processing system and method based on recognition of face
US20190065825A1 (en) * 2017-08-23 2019-02-28 National Applied Research Laboratories Method for face searching in images

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10943096B2 (en) * 2017-12-31 2021-03-09 Altumview Systems Inc. High-quality training data preparation for high-performance face recognition systems
CN108764067A (en) * 2018-05-08 2018-11-06 北京大米科技有限公司 Video intercepting method, terminal, equipment and readable medium based on recognition of face
CN110188602A (en) * 2019-04-17 2019-08-30 深圳壹账通智能科技有限公司 Face identification method and device in video
CN110543811B (en) * 2019-07-15 2024-03-08 华南理工大学 Deep learning-based non-cooperative examination personnel management method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106658169A (en) * 2016-12-18 2017-05-10 北京工业大学 Universal method for segmenting video news in multi-layered manner based on deep learning
US20190065825A1 (en) * 2017-08-23 2019-02-28 National Applied Research Laboratories Method for face searching in images
CN108647621A (en) * 2017-11-16 2018-10-12 福建师范大学福清分校 A kind of video analysis processing system and method based on recognition of face
CN107911748A (en) * 2017-11-24 2018-04-13 南京融升教育科技有限公司 A kind of video method of cutting out based on recognition of face
CN108337532A (en) * 2018-02-13 2018-07-27 腾讯科技(深圳)有限公司 Perform mask method, video broadcasting method, the apparatus and system of segment

Also Published As

Publication number Publication date
CN111460226A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
Recasens et al. Broaden your views for self-supervised video learning
CN111935491B (en) Live broadcast special effect processing method and device and server
WO2018006825A1 (en) Video coding method and apparatus
US9860593B2 (en) Devices, systems, methods, and media for detecting, indexing, and comparing video signals from a video display in a background scene using a camera-enabled device
US20090290791A1 (en) Automatic tracking of people and bodies in video
CN111954053B (en) Method for acquiring mask frame data, computer equipment and readable storage medium
US20170060867A1 (en) Video and image match searching
CN107343220B (en) Data processing method and device and terminal equipment
US20170289624A1 (en) Multimodal and real-time method for filtering sensitive media
US20170339417A1 (en) Fast and robust face detection, region extraction, and tracking for improved video coding
US20100177194A1 (en) Image Processing System and Method for Object Tracking
US20150227780A1 (en) Method and apparatus for determining identity and programing based on image features
WO2021164216A1 (en) Video coding method and apparatus, and device and medium
US20220147735A1 (en) Face-aware person re-identification system
EP3769258A1 (en) Content type detection in videos using multiple classifiers
CN114339360B (en) Video processing method, related device and equipment
WO2021196409A1 (en) Video figure retrieval method and retrieval system based on deep learning
Zhao et al. Laddernet: Knowledge transfer based viewpoint prediction in 360◦ video
CN113011254A (en) Video data processing method, computer equipment and readable storage medium
JP7211373B2 (en) MOVING IMAGE ANALYSIS DEVICE, MOVING IMAGE ANALYSIS SYSTEM, MOVING IMAGE ANALYSIS METHOD, AND PROGRAM
CN110570441A (en) Ultra-high definition low-delay video control method and system
Hayashida et al. Real-time human detection using spherical camera for web browser-based telecommunications
GB2608063A (en) Device and method for providing missing child search service based on face recognition using deep learning
CN116391200A (en) Scaling agnostic watermark extraction
CN111954081A (en) Method for acquiring mask data, computer device and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20929466

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20929466

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20929466

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 05/04/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20929466

Country of ref document: EP

Kind code of ref document: A1