CN111898488A

CN111898488A - Video image identification method and device, terminal and storage medium

Info

Publication number: CN111898488A
Application number: CN202010679057.4A
Authority: CN
Inventors: 谢导
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2020-07-15
Filing date: 2020-07-15
Publication date: 2020-11-06

Abstract

The disclosure provides a video image identification method, a video image identification device, a video image identification terminal and a storage medium, and belongs to the technical field of internet. The method comprises the following steps: identifying key points of the human face in each frame of video image; if the face key points in the video image are not identified, identifying the face key points in the video image according to the face identification result of the designated picture, wherein the designated picture is a reference picture of which the time interval between the acquisition time and the acquisition time of the video image is within a first preset time range; and determining the face area of the video image based on the identified face key points. When the face area in the video image is identified, firstly, the identification is carried out based on the video identification mode, and the video image of which the face key point can not be identified in the video identification mode is subjected to auxiliary identification by combining with the reference picture.

Description

Video image identification method and device, terminal and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a terminal, and a storage medium for identifying a video image.

Background

With the development of internet technology, live webcasts are favored by more and more users, and in the live webcasts process, the anchor webcasts usually start functions such as beautifying and stickers to beautify the appearance of the anchor webcasts so as to attract more users to watch the anchor webcasts. Before the functions of beautifying, pasting paper and the like are realized, the face area of the anchor needs to be identified from the video image.

At present, when a video image is identified, the following method is mainly adopted in the related art: acquiring a video image of a user in real time, identifying face key points in a key frame of the video image, and determining a face area in the key frame based on the identified face key points; for a non-key frame of a video image, tracking and positioning face key points in the key frame by adopting a tracking algorithm, identifying the face key points in the non-key frame, and determining a face area in the non-key frame based on the identified face key points.

However, in a live video scene, limited by the distance and the position between the anchor and the camera, the face area in the captured video image may be larger or smaller, and when the face area in the video image is smaller, the face area cannot be accurately identified, especially when the face area in the video image is less than 20%, the face area in the video image cannot even be identified.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, a terminal and a storage medium for identifying a video image, which can accurately identify a face area in the video image, and the technical scheme is as follows:

in one aspect, a method for identifying a video image is provided, and the method includes:

in the process of collecting video images, identifying key points of a human face in each frame of video image;

for any frame of video image, if a face key point in the video image is not identified, obtaining a face identification result of a designated picture, wherein the designated picture is a reference picture of which the time interval between the acquisition time and the acquisition time of the video image is within a first preset time range, and the reference picture is a picture for performing auxiliary identification on the video image;

identifying face key points in the video image according to the face identification result of the specified picture;

and determining a face area of the video image based on the identified face key points.

In another embodiment of the present disclosure, the obtaining a face recognition result of the specified picture includes:

acquiring the acquisition time of the video image;

acquiring a reference picture with the shortest time interval between acquisition time and the acquisition time of the video image from a reference picture set according to the acquisition time of the video image, and taking the acquired reference picture as the designated picture, wherein the reference picture set comprises a plurality of reference pictures and face identification results thereof;

and acquiring a face recognition result of the specified picture from the reference picture set.

In another embodiment of the present disclosure, the method further comprises:

in the acquisition process of the video image, acquiring a reference picture at intervals of a second preset time length;

extracting the face features of each reference picture;

identifying each reference picture based on the face characteristics of each reference picture to obtain a face identification result of each reference picture;

and constructing the reference picture set based on each reference picture and the face recognition result thereof.

In another embodiment of the present disclosure, the identifying key points of a face in the video image according to the face identification result of the designated picture includes:

cutting the video image according to the face recognition result of the specified picture;

and identifying the key points of the human face in the cut video image.

In another embodiment of the present disclosure, after determining a face region of the video image based on the identified face key points, the method further includes:

detecting the selected special effect option;

and adding a corresponding special effect in the face area based on the selected special effect option.

In another aspect, an apparatus for recognizing a video image is provided, the apparatus comprising:

the identification module is used for identifying the key points of the human face in each frame of video image in the acquisition process of the video image;

the acquisition module is used for acquiring a face recognition result of a specified picture if a face key point in a video image is not recognized for any frame of video image, wherein the specified picture is a reference picture with a time interval between acquisition time and acquisition time of the video image within a first preset time range, and the reference picture is a picture for performing auxiliary recognition on the video image;

the identification module is further used for identifying the key points of the face in the video image according to the face identification result of the specified picture;

and the determining module is used for determining the face area of the video image based on the identified face key points.

In another embodiment of the present disclosure, the obtaining module is configured to obtain a capture time of the video image; acquiring a reference picture with the shortest time interval between acquisition time and the acquisition time of the video image from a reference picture set according to the acquisition time of the video image, and taking the acquired reference picture as the designated picture, wherein the reference picture set comprises a plurality of reference pictures and face identification results thereof; and acquiring a face recognition result of the specified picture from the reference picture set.

In another embodiment of the present disclosure, the apparatus further comprises:

the acquisition module is used for acquiring a reference picture every second preset time length in the acquisition process of the video image;

the extraction module is used for extracting the face features of each reference picture;

the identification module is used for identifying each reference picture based on the face characteristics of each reference picture to obtain the face identification result of each reference picture;

and the construction module is used for constructing the reference picture set based on each reference picture and the face recognition result thereof.

In another embodiment of the present disclosure, the recognition module is further configured to crop the video image according to a face recognition result of the designated picture; and identifying the key points of the human face in the cut video image.

the detection module is used for detecting the selected special effect options;

and the adding module is used for adding corresponding special effects in the face area based on the selected special effect options.

In another aspect, a terminal is provided, which includes a processor and a memory, where at least one program code is stored in the memory, and the at least one program code is loaded and executed by the processor to implement the method for identifying a video image according to one aspect.

In another aspect, a computer readable storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the method for identifying video images according to one aspect.

The technical scheme provided by the embodiment of the disclosure has the following beneficial effects:

when the face area in the video image is identified, firstly, the identification is carried out based on the video identification mode, and the video image of which the face key point can not be identified in the video identification mode is assisted and identified by combining the reference picture.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is an implementation environment related to a video image recognition method provided by an embodiment of the present disclosure;

fig. 2 is a flowchart of a video image recognition method provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a face key point provided in an embodiment of the present disclosure;

fig. 4 is a flowchart of a process for recognizing a face image according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an apparatus for recognizing a video image according to an embodiment of the present disclosure;

fig. 6 shows a block diagram of a terminal according to an exemplary embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

At present, when a face region in a video image is recognized, there are two recognition modes:

the first is a video recognition mode, which mainly recognizes key points of a face of a collected video image, and the recognition mode has high recognition speed but low recognition accuracy, and may not recognize the video image with a small face area.

The second is a picture recognition mode, which mainly recognizes the face image in a single picture, and the recognition mode can recognize various characteristics represented by the face, such as facial features, face shape, age, gender and the like, so that the recognition accuracy is high, but the recognition speed is low due to the adoption of a fine recognition algorithm.

However, in scenes with high real-time performance such as live video, the recognition accuracy and the recognition speed of the face image are high, and the two existing recognition modes cannot meet the requirements of scenes with high real-time performance. Therefore, the embodiment of the disclosure provides a video image identification method, in the process of acquiring a video image, a frame of reference picture is acquired at intervals of preset time, and the frame of reference picture is identified based on a picture identification mode, so that when each frame of acquired video image is identified based on the video identification mode, the video image which can not identify the key points of the face can be identified in an auxiliary manner based on the picture identification result, and the identification accuracy is improved on the premise of ensuring the identification speed.

Referring to fig. 1, an implementation environment related to the identification method of a video image provided by the embodiment of the present disclosure is shown, and referring to fig. 1, the implementation environment includes: a terminal 101 and a server 102.

The terminal 101 is installed with a live video application or a photographing application, and based on the installed live video application or photographing application, a video photographing service can be provided for a user. In order to be able to photograph the user, the terminal 101 is further provided with a camera including at least one of a front camera and a rear camera. The terminal 101 may be a smart phone, a tablet computer, a digital camera, or other devices, and the product type of the terminal 101 is not specifically limited in the embodiment of the present disclosure.

The server 102 is a background server for a live video application or a photographing application, and the server 102 can receive video data uploaded by a terminal based on the live video application or the photographing application installed on the terminal and forward the video data uploaded by the terminal to other users. The server 102 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers.

The terminal 101 and the server 102 may be connected via a wired network or a wireless network.

It is to be understood that the terms "each," "a plurality," and "any" and the like, as used in the embodiments of the present disclosure, are intended to encompass two or more, each referring to each of the corresponding plurality, and any referring to any one of the corresponding plurality. For example, the plurality of words includes 10 words, and each word refers to each of the 10 words, and any word refers to any one of the 10 words.

Based on the implementation environment shown in fig. 1, an embodiment of the present disclosure provides a method for recognizing a face image, and referring to fig. 2, a flow of the method provided by the embodiment of the present disclosure includes:

201. in the process of collecting the video images, the terminal identifies the key points of the human face in each frame of video image.

Under the scene of live video or photographing, the terminal can automatically start the camera and call the installed camera to acquire the video image of the user. When a terminal calls a camera to collect a video image, a preset frame rate is usually adopted for collection, the preset frame rate can be determined according to the performance of the camera, the performance of the camera is different, the frame rates for collecting the video image are also different, the preset frame rate can be 30 frames/second, 40 frames/second and the like, and the embodiment of the disclosure does not specifically limit the preset frame rate. The preset frame rate can also reflect the generation speed of the video image, the higher the frame rate is, the higher the generation speed of the video image is, and the lower the frame rate is, the lower the generation speed of the video image is. When the preset frame rate is 30 frames/second, and the video images are collected based on the frame rate, the generation speed of the video images can be 30 milliseconds/frame; when the preset frame rate is 40 frames/second, the video images may be generated at a speed of 25 milliseconds/frame when the video images are acquired based on the frame rate.

The human face key points are key points capable of representing human face five sense organs and human face shapes, and comprise eyebrow key points, eye key points, nose key points, mouth key points and face contour key points. The number of the face key points is determined according to the identification precision requirement on the video image and the computing power of the terminal, and can be 44, 88, 106 and the like. If the requirement on the identification precision of the video image is high and the computing capability of the terminal is strong, a relatively large number of face key points can be selected; if the requirement on the identification accuracy of the video image is low and the computing capability of the terminal is weak, a relatively small number of face key points can be selected. Referring to fig. 3, annotated 44 face keypoints are shown.

In a live video or photographing scene, generally, only simple processing such as beautifying and chartlet is performed on a face region in a video image, and fine identification is not required to be performed on each video image, so that in the live video or photographing scene, in order to improve the identification speed of the face region in the video image, in the acquisition process of the video image, for each frame of acquired video image, a terminal can identify key points of the face in each frame of video image based on a video identification mode.

In the video recognition mode, the recognition accuracy of the video image is low, and especially when the face area in the video image occupies a small area, the key points of the face in the video image cannot be recognized. In order to improve the identification precision of a video image in a video identification mode, the method provided by the embodiment of the disclosure collects a plurality of reference pictures in the collection process of the video image, identifies the plurality of reference pictures, and constructs a reference picture set according to the plurality of reference pictures and the face identification result thereof, so that when the face key points in the video image are identified, auxiliary identification can be performed by means of the reference picture set.

The reference picture set comprises a plurality of reference pictures and face recognition results thereof. For the construction process of the reference picture set, the following steps can be adopted:

2011. and the terminal collects a frame of reference picture every second preset time.

The reference picture is a picture for assisting in identifying the video image. The second preset time is longer than the acquisition speed of each frame of picture, and can be determined according to the identification precision of the video image, wherein the higher the identification precision of the video image is, the shorter the preset time is, the lower the identification precision of the video image is, and the longer the preset time is. The second preset duration may be 1 second, 2 seconds, etc.

2012. And the terminal extracts the face features of each reference picture.

The facial features may include gender features, age features, facial features, and the like. When extracting the face features of each reference picture, the terminal can input each reference picture into the face feature extraction model, and extract the face features of each reference picture based on the face feature extraction model. The face feature extraction model comprises a gender feature extraction model, an age feature extraction model, a five sense organs feature extraction model and the like. For example, the terminal may input each reference picture into the gender feature extraction model, and output the gender feature of each reference picture; for another example, the terminal may input each reference picture into the age feature extraction model, and output the age feature of each reference picture; for another example, the terminal may input each reference picture into the feature extraction model for the five sense organs and output the features of the five sense organs of each reference picture.

For the gender feature extraction model, the age feature extraction model and the facial feature extraction model, the initial gender feature extraction model, the initial age feature extraction model and the initial facial feature extraction model can be trained on the basis of the reference face image labeled with the gender feature, the age feature and the facial feature key points. Taking the gender feature extraction model as an example, the training process of the gender feature extraction model may include: the method comprises the steps of obtaining face images with different preset numbers and genders, labeling the gender characteristics of each face image, extracting the gabor characteristics of each face image labeled with the gender characteristics, carrying out dimensionality reduction on the gabor characteristics of the face images with the gender characteristics labeled with the preset numbers, and training an initial gender characteristic extraction model based on the gabor characteristics subjected to dimensionality reduction to obtain a gender characteristic extraction model.

2013. And identifying each reference picture by the terminal based on the face characteristics of each reference picture to obtain the face identification result of each reference picture.

And based on the face features of each reference picture, the terminal inputs the face features of each reference picture into the face recognition model and outputs the face recognition result of each face picture. The face recognition model comprises an age recognition model, a gender recognition model, a facial features recognition model and the like, and correspondingly, the face recognition result comprises an age recognition result, a gender recognition result and a facial features recognition result. The terminal inputs the age characteristics of each reference picture into an age identification model and outputs an age identification result corresponding to each reference picture; the terminal outputs a gender identification result corresponding to each reference picture by inputting each reference picture into the gender identification model; and the terminal outputs the identification result of the five sense organs corresponding to each reference picture by inputting each reference picture into the identification model of the five sense organs. For the age identification model, the gender identification model and the facial features identification model, the initial age identification model, the initial gender identification model and the initial facial features identification model can be obtained by training based on the reference face image labeled with the age identification result, the gender identification result and the facial features identification result.

2014. And constructing a reference picture set by the terminal based on each reference picture and the face recognition result thereof.

Based on each reference picture and the corresponding face recognition result, the terminal can construct a reference picture set by storing each reference picture and the corresponding face recognition result.

202. And for any frame of video image, if the face key point in the video image is not identified, the terminal acquires the face identification result of the appointed picture.

For any collected frame of video image, when the terminal identifies the frame of video image based on the video identification mode, two identification results are included, wherein one identification result is that the terminal identifies the face key point in the frame of video image, and the other identification result is that the terminal does not identify the face key point in the frame of video image. If the terminal identifies the key points of the face in the frame of video image, the following step 204 can be executed, and the face area of the frame of video image is determined according to the identified key points of the face; if the terminal does not recognize the face key points in the frame of video image, the terminal can acquire the face recognition result of the appointed reference picture and further recognize the face key points in the frame of video image based on the face recognition result of the appointed reference picture.

The designated picture is a reference picture with the shortest time interval between the acquisition time and the acquisition time of the video image. When the terminal obtains the face recognition result of the appointed reference picture, the following steps can be adopted:

2021. and the terminal acquires the acquisition time of the video image.

Generally, when a terminal calls a camera to acquire a video image, the acquisition time of each frame of video image is recorded. Based on the recorded acquisition time of each frame of video image, the terminal can acquire the acquisition time of the frame of video image for any frame of video image without recognizing the key points of the face.

2022. The terminal acquires a reference picture with the time between the acquisition time and the acquisition time of the video image within a first preset time range from the reference picture set according to the acquisition time of the video image, and takes the acquired reference picture as an appointed picture.

The terminal compares the acquisition time of the frame of video image with the acquisition time of each reference picture in the reference picture set, and can acquire the reference picture with the time interval between the acquisition time and the acquisition time of the video image within a first preset time range through comparison, so that the acquired reference picture is taken as a designated picture. Wherein, the first preset time period may be 10 ms, 20 ms, etc.

2023. And the terminal acquires the face recognition result of the specified picture from the reference picture set.

Because a plurality of reference pictures and face recognition results thereof are stored in the reference picture set, when the reference picture with the shortest time interval between the acquisition time and the acquisition time of the video image is acquired, the terminal can acquire the face recognition result of the specified picture from the reference picture set.

203. And the terminal identifies the key points of the face in the video image according to the face identification result of the designated picture.

When the terminal identifies the face key points in the video image according to the face identification result of the designated picture, the following steps can be adopted:

2031. and the terminal cuts the video image according to the face recognition result of the appointed picture.

Generally, in scenes with high real-time requirements, such as live video, the acquisition time of each frame of video image is tens of milliseconds generally, and in order to ensure the identification accuracy of the video image, the acquisition time interval of the reference picture is short, and it is generally difficult for a user to have a large-amplitude motion in the short time interval, so that a face region in the video image can be roughly determined based on a face identification result of a designated picture with a short time interval between the acquisition time and the acquisition time of the frame of video image, and further, based on the roughly determined face region, the rough position of the face region can be determined. The terminal can cut the video image according to the approximate position of the face area, cut off the non-face area at the approximate position of the video image, reserve the face area, make the face area in the video image account for and greatly improve, change the video image from "small face image" to "large face image", the human face key point in the video image of easy recognition. For example, if the face region of the designated picture is determined to be 180 pixels by 180 pixels based on the face recognition result of the designated picture, the video image may be cropped to 200 pixels by 200 pixels.

2032. And the terminal identifies the key points of the human face in the cut video image.

And under the video identification mode, the terminal identifies the key points of the human face in the cut video image. If the key points of the human face in the cut video image are identified, executing step 204 to further determine the human face area of the video image; and if the key points of the human face in the cut video image are not identified, determining that the video image is failed to be identified.

In another embodiment of the present disclosure, if the face region of the designated picture cannot be recognized according to the face recognition result of the designated picture, it is determined that the face key point of the video image cannot be recognized, and the recognition of the video image fails.

204. And the terminal determines the face area of the video image based on the identified face key points.

Based on the identified face key points, the terminal can determine the face area of the frame of video image by connecting the face key points, wherein the face area comprises the five sense organs, the face shape and the like of the user.

Further, in the live video process, in order to attract more users to watch, the users usually beautify the appearance of the users by beautifying the five sense organs of the users in the face area and adding stickers based on special effect options such as beauty, stickers and the like. During specific implementation, the terminal can detect the selected special effect option, and then add a corresponding special effect in the face area based on the selected special effect option. For example, when detecting that the sticker option is selected, the terminal will add a sticker in the face area.

For the above-mentioned video image recognition process, the following description will be made by taking fig. 4 as an example.

Referring to fig. 4, after the live video broadcast starts, the terminal starts the camera, collects video images based on the camera, and for any collected frame of video image a, the terminal identifies key points of a human face in the frame of video image a based on a video identification mode, and meanwhile, in the collection process of the video images, the terminal collects a frame of picture C every 1 second, and identifies the collected picture C based on the picture identification mode to obtain an identification result. For the video image A, if the face key points in the frame of video image are identified based on the video identification mode, the face areas determined by the face key points are beautified by adopting the beautification prop, and the beautified video image is pushed to other users; if the key points of the face in the frame of video image cannot be identified based on the video identification mode, acquiring the face identification result of the appointed picture which is acquired last time, further based on the face identification result of the appointed picture, if the face area cannot be roughly determined, determining that the identification of the video image A fails, if the face area can be roughly determined, cutting the video image A to obtain a video image B, then based on the video identification mode, identifying the key points of the face in the video image B, if the key points of the face in the video image B are identified, performing beauty treatment on the face area determined by the key points of the face by adopting beauty props, and pushing the video image after beauty treatment to other users.

According to the method provided by the embodiment of the disclosure, when the face area in the video image is identified, firstly, the identification is carried out based on the video identification mode, and for the video image of which the key point of the face cannot be identified in the video identification mode, the auxiliary identification is carried out by combining the reference picture.

Referring to fig. 5, an embodiment of the present disclosure provides an apparatus for identifying a video image, including:

the identification module 501 is configured to identify a face key point in each frame of video image in the process of acquiring the video image;

an obtaining module 502, configured to, for any frame of video image, if a face key point in the video image is not identified, obtain a face identification result of a specified picture, where the specified picture is a reference picture in which a time interval between acquisition time and acquisition time of the video image is within a first preset time range, and the reference picture is a picture used for performing auxiliary identification on the video image;

the recognition module 501 is further configured to recognize a face key point in the video image according to a face recognition result of the designated picture;

a determining module 503, configured to determine a face area of the video image based on the identified face key point.

In another embodiment of the present disclosure, the obtaining module 502 is configured to obtain a capture time of a video image; according to the acquisition time of the video image, acquiring a reference picture with the shortest time interval between the acquisition time and the acquisition time of the video image from a reference picture set, and taking the acquired reference picture as a designated picture, wherein the reference picture set comprises a plurality of reference pictures and face recognition results thereof; and acquiring a face recognition result of the specified picture from the reference picture set.

an obtaining module 502, configured to obtain a reference picture every second preset duration in a video image acquisition process;

the recognition module is used for recognizing each reference picture based on the face characteristics of each reference picture to obtain a face recognition result of each reference picture;

and the construction module is used for constructing a reference picture set based on each reference picture and the face recognition result thereof.

In another embodiment of the present disclosure, the recognition module 501 is further configured to crop the video image according to the face recognition result of the designated picture; and identifying the key points of the human face in the cut video image.

the detection module is used for detecting the selected special effect options;

In summary, the apparatus provided in the embodiment of the present disclosure, when a face region in a video image is identified, firstly, the face region is identified based on a video identification mode, and for the video image in which the key points of the face cannot be identified in the video identification mode, auxiliary identification is performed by combining a reference picture.

Fig. 6 shows a block diagram of a terminal 600 according to an exemplary embodiment of the present disclosure. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer iv, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 602 is used to store at least one instruction for execution by the processor 601 to implement the method for recognizing facial images provided by the method embodiments in the present application.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera assembly 606, an audio circuit 607, a positioning component 608, and a power supply 609.

The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.

The positioning component 608 is used to locate the current geographic location of the terminal 600 to implement navigation or LBS (location based Service). The positioning component 608 can be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 609 is used to provide power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 613 may be disposed on the side bezel of terminal 600 and/or underneath display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when proximity sensor 616 detects that the distance between the user and the front face of terminal 600 gradually decreases, processor 601 controls display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front face of the terminal 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The terminal provided by the embodiment of the disclosure identifies the face area in the video image based on the video identification mode, and for the video image of which the key point of the face cannot be identified in the video identification mode, the terminal performs auxiliary identification by combining the reference picture.

The embodiment of the present disclosure provides a computer-readable storage medium, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the method for recognizing a face image shown in fig. 2. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The computer-readable storage medium provided by the embodiment of the disclosure identifies a face region in a video image based on a video identification mode, and for the video image in which the key points of the face cannot be identified in the video identification mode, performs auxiliary identification by combining a reference picture.

The above description is intended to be exemplary only and not to limit the present disclosure, and any modification, equivalent replacement, or improvement made without departing from the spirit and scope of the present disclosure is to be considered as the same as the present disclosure.

Claims

1. A method for identifying a video image, the method comprising:

2. The method according to claim 1, wherein the obtaining of the face recognition result of the specified picture comprises:

acquiring the acquisition time of the video image;

3. The method of claim 2, further comprising:

extracting the face features of each reference picture;

4. The method according to claim 1, wherein the identifying key points of the face in the video image according to the face identification result of the designated picture comprises:

and identifying the key points of the human face in the cut video image.

5. The method according to any one of claims 1 to 4, wherein after determining the face region of the video image based on the identified face key points, further comprising:

detecting the selected special effect option;

6. An apparatus for recognizing a video image, the apparatus comprising:

7. The apparatus of claim 6, wherein the obtaining module is configured to obtain an acquisition time of the video image; acquiring a reference picture with the shortest time interval between acquisition time and the acquisition time of the video image from a reference picture set according to the acquisition time of the video image, and taking the acquired reference picture as the designated picture, wherein the reference picture set comprises a plurality of reference pictures and face identification results thereof; and acquiring a face recognition result of the specified picture from the reference picture set.

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 6, wherein the recognition module is further configured to crop the video image according to a face recognition result of the designated picture; and identifying the key points of the human face in the cut video image.

10. The apparatus of any one of claims 6 to 9, further comprising:

the detection module is used for detecting the selected special effect options;

11. A terminal characterized in that it comprises a processor and a memory in which at least one program code is stored, said at least one program code being loaded and executed by said processor to implement the method for identification of video images according to any one of claims 1 to 5.

12. A computer-readable storage medium, wherein at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by a processor to implement the method for identifying video images according to any one of claims 1 to 5.