WO2018033155A1

WO2018033155A1 - Video image processing method, apparatus and electronic device

Info

Publication number: WO2018033155A1
Application number: PCT/CN2017/098201
Authority: WO
Inventors: 栾青; 彭义刚
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2016-08-19
Filing date: 2017-08-19
Publication date: 2018-02-22
Also published as: CN107341435A

Abstract

Provided in the embodiments of the present application are a video image processing method, apparatus and terminal device. The method comprises: performing human face facial movement detection on a video image currently being played back which contains human face information; determining a presentation position of a service object to be presented in the video image when a detected facial movement matches a corresponding predetermined facial movement; and drawing the service object to be presented in the presentation position using a computer drawing mode. Using the embodiments of the present application may save network resources and/or system resources of a client, make a video image more interesting, and avoid bothering a user when normally watching a video, thereby reducing the user's feelings of opposition to a service object presented in the video image, while attracting the attention of an audience to a certain extent, and increasing the impact of the service object.

Description

Video image processing method, device and electronic device

This application claims the priority of the Chinese Patent Application, filed on Aug. 19, 2016, with the application number CN201610697502.3, and the invention entitled "Processing method, device and terminal device for video images", the entire contents of which are incorporated by reference. Combined in this application.

Technical field

The present application relates to artificial intelligence technology, and in particular, to a video image processing method, apparatus, and electronic device.

Background technique

With the development of Internet technology, people are increasingly using the Internet to watch video, and Internet video offers business opportunities for many new businesses. Internet video is considered a premium resource for ad placement because it can be an important entry point for business traffic.

Existing video advertisements are mainly inserted into a fixed-time advertisement at a certain time of video playback, or placed in a fixed position in the area where the video is played and its surrounding area.

Summary of the invention

The embodiment of the present application provides a solution for processing a video image.

According to an aspect of the embodiments of the present application, a method for processing a video image includes: performing facial motion detection of a face on a currently played video image including face information; and detecting a facial motion and a corresponding predetermined facial surface When the actions match, the presentation position of the business object to be presented in the video image is determined; and the business object to be presented is drawn by using a computer drawing manner at the presentation position.

According to another aspect of the embodiments of the present application, a video image processing apparatus is provided, including: a video image detecting module, configured to perform facial motion detection of a face on a currently played video image including face information; and display position determination a module, configured to determine a presentation position of a business object to be presented in the video image when a facial motion detected by the video image detecting module matches a corresponding predetermined facial motion; a business object rendering module, The presentation location draws the business object to be presented in a computer drawing manner.

According to still another aspect of the embodiments of the present application, an electronic device includes: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete each other through the communication bus The memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the processing method of the video image according to any of the above embodiments of the present application.

According to still another aspect of the embodiments of the present application, another electronic device is provided, including:

A processor and a video image processing apparatus according to any of the above embodiments of the present application;

When the processor runs the processing device of the video image, the unit in the video image processing apparatus according to any of the above embodiments of the present application is executed.

According to still another aspect of an embodiment of the present application, a computer program is provided, comprising computer readable code, when a computer readable code is run on a device, a processor in the device performs the above-described An instruction of each step in the method of processing a video image according to an embodiment.

According to still another aspect of the embodiments of the present application, a computer readable storage medium is provided for storing computer readable instructions, when executed, to implement the video image described in any of the above embodiments of the present application. The operation of each step in the processing method.

According to the method, device, and electronic device for processing a video image according to an embodiment of the present application, facial motion detection is performed on a currently-played video image including face information, and the detected facial motion is matched with a corresponding predetermined facial motion. Determining a presentation position of the business object to be presented in the video image when the two match, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, Since the business object to be presented is drawn in the presentation position by using the computer drawing method, the business object is combined with the video playing, and no additional advertising video data irrelevant to the video is transmitted through the network, which is beneficial to saving network resources and/or system resources of the client; On the other hand, business objects and facial movements in video images are tight The combination of secrets not only preserves the main image and motion of the video subject (such as the anchor) in the video image, but also adds interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to reducing the user's display on the video image. The objection of the business object can attract the attention of the audience to a certain extent and improve the influence of the business object.

The technical solutions of the present application are further described in detail below through the accompanying drawings and embodiments.

DRAWINGS

The accompanying drawings, which are incorporated in FIG.

The present application can be more clearly understood from the following detailed description, in which:

1 is a flow chart of an embodiment of a method for processing a video image of the present application;

2 is a flowchart of an embodiment of a method for acquiring a first convolutional network model in an embodiment of the present application;

3 is a flow chart of another embodiment of a method for processing a video image of the present application;

4 is a flow chart of still another embodiment of a method for processing a video image of the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a processing apparatus for video images of the present application; FIG.

6 is a schematic structural diagram of another embodiment of a processing apparatus for video images of the present application;

7 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 8 is a schematic structural diagram of another embodiment of an electronic device of the present application.

detailed description

Various exemplary embodiments of the present application will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the application.

In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.

The following description of the at least one exemplary embodiment is merely illustrative and is in no way

Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but the techniques, methods and apparatus should be considered as part of the specification, where appropriate.

It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.

Those skilled in the art can understand that the terms “first”, “second” and the like in the embodiments of the present invention are only used to distinguish different steps, devices or modules, etc., and do not represent any specific technical meaning or between them. The inevitable logical order.

Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.

1 is a flow chart of an embodiment of a method for processing a video image of the present application. The processing method of the video image of each embodiment of the present application may be exemplarily executed by an electronic device such as a computer system, a terminal device, or a server. Referring to FIG. 1, a method for processing a video image of this embodiment includes:

In step S110, face motion detection of the face is performed on the currently played video image including the face information.

In various embodiments of the present application, facial actions may include, but are not limited to, blinking, opening, nodding, and pouting. The face information may include, for example, information related to the face, eyes, mouth, nose, and/or hair, and the like. The video image may be an image in a live video that is being broadcast live, or a video image that has been recorded or is in the process of being recorded.

In an optional example of the embodiments of the present application, a live video is taken as an example. Currently, the live video platform includes multiple, such as a pepper live broadcast platform, a YY live broadcast platform, etc., each live broadcast platform includes multiple live broadcast rooms. Each live room will include at least one anchor, and the anchor can broadcast a video to the fans in the live room where the electronic device is located, the live video includes multiple video images. The subject in the above video image is usually a main character (ie, anchor) and a simple background, and the anchor often occupies a larger area in the video image. When a business object (such as an advertisement) needs to be inserted in the process of the live video, the video image in the current live video can be obtained, and the video image can be detected by a preset face detection mechanism to determine the Whether the face information of the anchor is included in the video image, if the face information of the anchor is included, the video image is acquired or recorded for subsequent processing; if the face information of the anchor is not included, the video image of the next frame may be continued. The above related processing is performed to obtain a video image in which the video image includes the face information of the anchor.

In addition, the video image may also be a video image in a short video that has been recorded. In this case, the user can play the short video using the electronic device, and during the playing process, the electronic device can detect each frame of the video image. Whether the face information of the anchor is included, if the face information of the anchor is included, the video image is acquired for subsequent processing; if the face information of the anchor is not included, the video image may be discarded or not processed for the video image. And acquire the next frame of video image to continue the above processing.

In addition, in the case that the video image is a video image being recorded, during the recording process, the user can use his electronic device to detect whether the video image of each frame is included in the recorded video image of the anchor, if the person including the anchor For the face information, the video image is acquired for subsequent processing; if the face information of the anchor is not included, the video image may be discarded or not processed, and the next frame of the video image may be acquired to continue the above processing.

The electronic device that plays the video image or the electronic device used by the anchor is provided with a mechanism for detecting the facial motion of the face of the video image, and the facial motion detection mechanism can perform facial action on the currently played video image including the face information. Detecting, obtaining a facial motion of the face detected from the video image, an optional processing procedure may be: the electronic device acquires a video image currently being played, and may be from the video image through a preset mechanism The image including the face region is cut out, and then the image of the face region can be analyzed and feature extracted, and the feature data of each part (including the eyes, the mouth, the face, and the like) in the face region is obtained, by using the feature data. Analysis, determine the meaning of the facial motion of the face in the video image, that is, the facial motion belongs to blinking, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left, moving the eye to the right, turning to the left, turning to the right Which one or which of the head, head, head down, nodding, laughing, crying, frowning, opening your mouth, nodding or pouting .

In an alternative example, step S110 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by video image detection module 501 being executed by the processor.

In step S120, when the detected facial motion matches the corresponding predetermined facial motion, the presentation position of the business object to be presented in the video image is determined.

In various embodiments of the present application, the business object is an object created according to a certain business requirement, and may include, for example, but not limited to, advertisement, entertainment, weather forecast, traffic forecast, pet, and the like. The presentation position may be a center position of a designated area in the video image, or may be a coordinate of a plurality of edge positions in the specified area or the like.

In an optional example of various embodiments of the present application, feature data of a plurality of different facial actions may be pre-stored, and different facial actions are marked correspondingly to distinguish the meaning represented by each facial action. Through the processing of the above step S110, the facial motion of the facial face can be detected from the video image, and the feature data of the detected facial motion of the facial face can be compared with the feature data of each facial motion stored in advance, respectively. The feature data of the plurality of different facial actions stored in advance includes a facial motion that matches the feature data of the facial motion of the detected face, and then the detected facial motion can be determined to match the corresponding predetermined facial motion.

In order to improve the accuracy of the matching, the matching result may be determined by calculation. For example, a matching algorithm may be set to calculate the matching degree between the feature data of any two facial actions. For example, the matching algorithm may be used to detect The feature data of the facial motion of the face is matched with the feature data of any type of facial motion stored in advance, and the matching degree value between the two is obtained, and the characteristics of the detected facial motion of the face are respectively calculated by the above manner. Data and characteristics of each type of facial action stored in advance The value of the matching degree between the data, the largest matching degree value is selected from the obtained matching degree value, and if the maximum matching degree value exceeds the predetermined matching threshold, the pre-stored facial action corresponding to the largest matching degree value may be determined. Matches the detected facial movements. If the maximum matching degree value does not exceed the predetermined matching threshold, the matching fails, that is, the detected facial motion is not a predetermined facial motion, and at this time, the processing of the above step S110 may be continued on the subsequent video image.

Optionally, when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the meaning represented by the matched facial motion may be determined first, and the meaning of the matched facial motion may be selected in a plurality of preset display positions or The corresponding presentation position is used as the presentation position of the business object to be presented in the video image. For example, taking a live video as an example, when detecting that the anchor performs a facial motion of the beep, the mouth region may be selected as a related or related presentation position.

In an alternative example, step S120 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 being executed by the processor.

In step S130, the business object to be presented is drawn in a computer drawing manner at the presentation position.

It should be noted that in order to increase the visual effect of the business object and improve the interest of the video image, dynamic effects can be set for the business object. For example, the business object can be presented as a video, or can be dynamically displayed by multiple images. The way the presentation is presented, etc.

For example, taking a live video as an example, when detecting the facial motion of the anchor opening mouth, the corresponding business object, such as an advertisement image with a predetermined product identifier, may be drawn in the area where the mouth of the anchor of the video image is located, for example. If the fan is interested in the business object, the user can click on the area where the business object is located, and the electronic device of the fan can obtain the network link corresponding to the business object, and enter the page related to the business object through the network link, and the fan can This page gets the resources related to this business object.

In an optional example of the embodiments of the present application, the business object may be drawn by using a computer drawing manner, and may be implemented by appropriate computer graphics image drawing or rendering, for example, but not limited to: based on Open Graphics Language (OpenGL). The graphics drawing engine draws and so on. OpenGL defines a professional graphical program interface for cross-programming language and cross-platform programming interface specifications. It is hardware-independent and can easily draw two-dimensional (2D) or three-dimensional (3D) graphics images. Through the OpenGL graphics rendering engine, not only can 2D effects be drawn, but also 3D stickers can be drawn, and 3D effects can be drawn and particle effects can be drawn. However, the application is not limited to the drawing method based on the OpenGL graphics rendering engine, and other methods may be adopted. For example, the drawing method based on the graphics engine (Unity) or the Open Computing Language (OpenCL) is also applicable to the present invention. Apply for each embodiment.

In an alternative example, step S130 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 503 being executed by a processor.

The video image processing method provided by the embodiment of the present application performs facial motion detection on the currently played video image including the face information, and matches the detected facial motion with the corresponding predetermined facial motion, when the two match Determining a presentation position of the business object to be presented in the video image, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, by using a computer drawing method The presentation location draws the business object to be presented, and the business object is combined with the video playback, and does not need to transmit additional advertisement video data irrelevant to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object Closely combined with the facial motion in the video image, it is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to the video. Reduce the business objects that users display on video images Disgusted, to attract the viewer's attention to a certain extent, increase the influence of business objects.

In the first embodiment of the present application, the process of detecting the facial motion of the face of the video image in step S110 may be implemented by using a corresponding feature extraction algorithm or using a neural network model such as a convolutional network model. In this embodiment, the convolutional network model is taken as an example to perform facial motion detection on a video image. For this purpose, a first convolutional network model for detecting a facial motion state in the image may be pre-trained. 2 is a flow chart of an embodiment of a method for acquiring a first convolutional network model in an embodiment of the present application. The method for acquiring the first convolutional network model of the embodiment may be performed by any device having the functions of data collection, processing, and transmission, including but not limited to a mobile terminal and a personal computer (PC). . In order to train the first convolutional network model, the training sample may be obtained in a plurality of manners, and the training sample may be a plurality of sample images including face information, and the information of the face motion state is marked in the sample image, the face The information of the action state is used to determine the facial motion of the face. Referring to Figure 2, the first convolution of this embodiment The methods for obtaining the network model include:

In step S210, a plurality of sample images including face information are acquired, wherein the sample images are labeled with information of the face action state.

In various embodiments of the present application, the face information may include local attribute information and global attribute information, etc., wherein the local attribute information may include, for example but not limited to: hair color, hair length, eyebrow length, eyebrow thick or sparse, eye size, The global attribute information may include, for example, but not limited to: race, gender, age, etc., for the eyes to open or close, the height of the nose, the size of the mouth, the opening or closing of the mouth, whether to wear glasses, whether to wear a mask, or the like. The sample image may be an image in a video or a plurality of images continuously photographed, or may be any image, and may include an image including a human face and an image including no human face. The face action state, that is, the current action state of the face, for example, may include, but is not limited to, the face action belongs to blinking, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left, moving the eye to the right, turning to the left, Turn your head to the right, look up, bow your head, nod, laugh, cry, frown, open your mouth, nod or pout.

In the embodiment of the present application, the larger the resolution of the image is, the larger the amount of data is. When the face action state is detected, the more computing resources are required, the slower the detection speed. In view of this, in the present application, In an optional implementation manner, the sample image may be an image that satisfies a preset resolution condition. For example, the above preset resolution condition may be: the longest side of the image does not exceed 640 pixels, the shortest side does not exceed 480 pixels, and the like.

The sample image may be obtained by an image acquisition device, wherein the image acquisition device for collecting the face information of the user may be a dedicated camera or a camera integrated in other devices or the like. However, in actual applications, due to different hardware parameters, different settings, and the like of the image acquisition device, the acquired image may not satisfy the above preset resolution condition, in order to obtain a sample image that satisfies the above preset resolution condition, in the present application In an optional implementation manner, after the image acquisition device acquires the image, the acquired image is subjected to scaling processing to obtain a sample image that meets the condition.

After obtaining the sample image, the face motion state can be marked in each sample image, such as smile, pout, closed left eye, closed right eye, closed eyes, etc., and the face motion state marked in each sample image can be This sample image is stored as training data.

In order to make the detection of the face motion state in the sample image more accurate, the face in the sample image can be positioned to obtain the exact position of the face in the sample image. For details, refer to the following step S220.

In step S220, for each sample image, a face and a face key point in the sample image are detected, and a face in the sample image is positioned by the face key point to obtain face location information.

In each embodiment of the present application, each face has a certain feature point, such as a corner of the eye, an end of the eyebrow, a corner of the mouth, a tip of the nose, and the like, and a boundary point of the face, etc., in which the feature point of the face is obtained. Then, a face key point can be used to calculate a mapping of a face in the sample image to a preset standard face or a similar transformation, and the face in the sample image is aligned with the standard face, so that the sample image is The face is positioned to obtain the positioning information of the face in the sample image.

In step S230, a sample image containing face positioning information is taken as a training sample.

In an alternative example, steps S210-S230 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a training sample acquisition module 504 that is executed by the processor.

In step S240, the first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting the state of the facial motion in the image.

In various embodiments of the present application, the front end of the first convolutional network model may include a combination of multiple convolutional layers, pooling layers, and non-linear layers, and the back end may be a loss layer, such as based on a cost function (softmax) and / or loss layer of an algorithm such as cross entropy.

The structure of the first convolutional network model may include:

Input layer: This input layer is used to read the sample image and the information of the marked face action state. The input layer may preprocess the sample image, output a face image or face information including positioning information, and the like. The input layer outputs the preprocessed face image to the convolution layer, and inputs the preprocessed face information to the loss layer.

Convolutional layer: The input is a pre-processed face image or image feature, and the features of the face image are obtained by a predetermined linear transformation output.

Nonlinear layer: The nonlinear layer can nonlinearly transform the features of the convolutional layer input through nonlinear functions, so that the characteristics of its output have strong expression ability.

Pooling layer: The pooling layer can map multiple values to a single value. Therefore, the pooling layer can enhance the nonlinearity of the learned features, and can make the spatial size of the output features smaller, thereby enhancing learning. The translation of the feature (ie face translation) is invariant, Keep the extracted features unchanged. Wherein, the output feature of the pooling layer can be used again as the input data of the convolution layer or the input data of the fully connected layer.

Wherein, the convolution layer, the nonlinear layer and the pooling layer may be repeated one or more times, that is, the combination of the convolution layer, the nonlinear layer and the pooling layer may be repeated one or more times, wherein, for each time, the pooling layer The output data can be used as the re-input data for the convolutional layer. Multiple combinations of the convolutional layer, the nonlinear layer and the pooled layer can better process the input sample image, so that the features in the sample image have the best expression ability.

Fully connected layer: linearly transforms the input data of the pooled layer, and projects the learned features into a better subspace to facilitate the prediction of the face state.

Nonlinear layer: As with the function of the aforementioned nonlinear layer, the input characteristics of the fully connected layer are nonlinearly transformed. Its output characteristics can be used as input data for the loss layer or as input data for the fully connected layer again.

Wherein, the fully connected layer and the nonlinear layer may be repeated one or more times.

One or more loss layers: mainly responsible for calculating the error of the predicted face action state and the input face action state.

The network parameter in the first convolutional network model can be trained by the gradient descent algorithm passed backwards, so that the input layer can input the image, and the face action state corresponding to the face in the input image can be output. Information to obtain a first convolutional network model.

Through the above process, the input layer is responsible for simply processing the input sample image, and the combination of the convolutional layer, the nonlinear layer and the pooling layer is responsible for feature extraction of the sample image, and the fully connected layer and the nonlinear layer are extracted features to face information. The mapping, loss layer is responsible for calculating the prediction error. Through the multi-layer design of the first convolutional network model described above, the extracted features can have rich expression ability and better predict the state of the face action. In addition, multiple face information is connected to the loss layer at the same time, which ensures that multiple tasks learn at the same time and share the characteristics learned by the convolutional network.

In an alternative example, step S240 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional network model determination module 505 being executed by the processor.

In this embodiment, the first convolutional network model obtained by the training can facilitate subsequent facial motion detection on the currently played video image including the face information, and match the detected facial motion with the corresponding predetermined facial motion. Determining a presentation position of the business object to be presented in the video image when the two match, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, Since the business object to be presented is drawn in the presentation position by using the computer drawing method, the business object is combined with the video playing, and no additional advertising video data irrelevant to the video is transmitted through the network, which is beneficial to saving network resources and/or system resources of the client; On the other hand, the business object is closely combined with the facial motion in the video image, which is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image without disturbing the user. Normal viewing of the video helps to reduce the user's exposure to the video image Objectionable business objects, to attract the viewer's attention to a certain extent, increase the influence of business objects.

3 is a flow chart of another embodiment of a method for processing a video image of the present application. In this embodiment, the service object is a special effect including semantic information. Illustratively, the special effect including the semantic information may include: at least one or any of the following special effects including the advertisement information: two-dimensional sticker special effect, three-dimensional Special effects, particle effects, etc. The video image may be a live video image, such as a video image when a live broadcast of an anchor in the pepper live broadcast platform. As shown in FIG. 3, the processing method of the video image of this embodiment includes:

In step S310, the currently played video image containing the face information is acquired.

For the specific processing of the foregoing step S310, refer to the related content of the step S110 in the embodiment shown in FIG. 1 , and details are not described herein again.

In step S320, a face key point is extracted from the video image, and a first convolutional network model for detecting a face action state in the image is used, and a face of the video image is determined according to the face key point. Facial movements.

In an implementation, the video image may be detected to determine whether a human face is included in the video image. If it is determined that the video image includes a human face, the face key point is extracted in the video image. The acquired video image and face key points can be input into a first convolutional network model, which is trained, for example, by the embodiment shown in FIG. 2 above. Through the network parameters in the first convolutional network model, the video image can be separately processed, such as feature extraction, mapping and transformation, to detect the motion of the face of the video image, and obtain the state of the face motion in the video image, thereby based on the human The action state of the face, which can determine the face of the face contained in the video image Department action.

It should be noted that, for a facial motion obtained by combining a plurality of facial motion states (for example, blinking, blinking, closing, and blinking, or closing eyes, blinking, and closing eyes), The facial motion of the type is divided into a plurality of facial motion states. For example, in the case of blinking, the blinking state and the closed eye state may be divided into: the above processing may specifically: extract a key point of the face from the video image, and use the pre-training. Preferably, the first convolutional network model for detecting the state of the face motion in the image determines the state of the face motion in the video image, and determines the facial motion of the face in the video image according to the state of the face motion in the video image.

In this embodiment, a plurality of video images currently containing the face information may be acquired, and the continuity of the plurality of video images may be determined to determine whether the plurality of video images are continuous in space and time. If it is judged to be discontinuous, the authentication fails or the user is prompted to reacquire the video image. When performing video image continuity determination, for example, each frame of video image can be divided into 3×3 regions, and a color histogram, a mean and variance of gray scales, and a histogram of two adjacent face images are established on each region. The distance of the graph, the distance of the gray mean and the distance of the gray variance are treated as feature vectors, and the linear classifier is used to determine whether the linear classifier is greater than or equal to zero. Among them, the parameters in the linear classifier can be trained by the sample data with the annotation information. If the linear classifier is judged to be greater than or equal to zero, then the two adjacent video images are considered to be continuous in time and space. At this time, the corresponding person may be determined based on the face key points extracted from each video image. a face action state to determine a face motion exhibited by a plurality of consecutive video images; if the linear classifier is judged to be less than zero, it is considered that the adjacent two video images are discontinuous in time and space, When the current video image is used as a starting point, the processing of the above step S310 is continued to be performed on the subsequent video image.

If the plurality of video images are continuous, the first convolutional network model may be used to determine the state of the facial motion of the face in the video image of the frame based on the face key points extracted from each video image, for example, In the case of blinking, the probability of the blink state or the probability of the closed eye state can be calculated at this time to determine the state of the face motion in the video image. To this end, an image block (including face information) can be extracted near the center of the key point corresponding to the blinking action, and the state of the face action state can be obtained by the first convolutional network model. Then, the facial motion of the face in the video image can be determined based on the state of the face motion in each video image.

For a situation in which a face motion state can be determined by a face action state (eg, smile, mouth opening, beeping, etc.), a detected video with a face state of a smile, a mouth opening, or a beep can be detected. The image can be determined according to the processing of the above step S320 to determine the facial motion of the corresponding face.

In an alternative example, step S320 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by video image detection module 501 being executed by the processor.

In step S330, when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the facial feature point in the facial region corresponding to the detected facial motion is extracted.

In each embodiment of the present application, for each video image including face information, the face includes certain feature points, such as feature points such as eyes, nose, mouth, and facial contour. The detection of the face in the video image and the determination of the feature point can be implemented in any suitable related art, which is not limited in the embodiment of the present application. For example, linear feature extraction methods such as principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), etc.; for example, nonlinear feature extraction methods such as kernel principal component analysis (Kernel PCA), manifolds The learning and the like can also be performed using the trained neural network model, such as the convolutional network model in the embodiment of the present application, which is not limited in the embodiment of the present application.

Taking a live video as an example, in the process of video live broadcast, the face is detected from the video image of the live video and the face feature point is determined; for example, during the playback of a recorded video, The face image is detected in the played video image and the face feature point is determined; for example, in the recording process of a certain video, the face is detected from the recorded video image and the face feature point is determined.

In step S340, according to the face feature point, the presentation position of the business object to be presented in the video image is determined.

In an alternative example, steps S330-S340 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 that is executed by the processor.

In various embodiments of the present application, after the face feature point is determined, the one or more presentation positions of the business object to be presented in the video image may be determined based on the basis.

In the embodiments of the present application, when determining the presentation position of the business object to be presented in the video image according to the facial feature point, the optional implementation manner includes, but is not limited to, the following manner: manner one, according to the facial feature point, Determining a presentation location of the business object to be presented in the video image using a pre-trained second convolutional network model for determining a presentation location of the business object in the video image; In the second manner, the presentation location of the business object to be presented in the video image is determined according to the face feature point and the type of the business object to be presented.

Hereinafter, the above two modes will be described separately.

method one

When the usage mode 1 determines the presentation position of the business object to be presented in the video image, a convolutional network model, that is, a second convolutional network model, may be pre-trained, and the trained second convolutional network model has the determined service object The function of presenting the position in the video image; alternatively, a convolutional network model that has been trained by a third party to have the function of determining the presentation position of the business object in the video image can also be used directly.

It should be noted that, in the embodiments of the present application, the training of the business object is taken as an example, but those skilled in the art should understand that the second convolutional network model can also train the business object at the same time. The face is trained to achieve joint training of faces and business objects.

When training the second convolutional network model, an optional training method includes the following process:

(1) Obtaining a feature vector of a sample image of the training sample.

The feature vector includes position information and/or confidence information of the business object in the sample image of the training sample, and a face feature vector corresponding to the face feature point in the face region corresponding to the facial motion in the sample image. The confidence information of the business object indicates the probability that the business object can achieve the effect (such as being focused or clicked or viewed) when the current location is displayed. The probability may be set according to the statistical analysis result of the historical data, or may be According to the results of the simulation experiment, it can also be set according to the artificial experience. In practical applications, only the location information of the service object may be trained according to actual needs, or only the confidence information of the service object may be trained, and the location information and the confidence information may be trained. The training of the location information and the confidence information enables the trained second convolutional network model to more effectively and accurately determine the location information and confidence information of the business object, so as to provide a basis for the display of the business object.

The second convolutional network model is trained by a large number of sample images. In the embodiment of the present application, the second convolutional network model can be trained using the sample image including the business object, and those skilled in the art should understand that The sample image of the training, in addition to the business object, should also contain information on the state of the face action (ie, information for determining the face action of the face). In addition, the business object in the sample image in the embodiment of the present application may pre-label the location information, or the confidence information, or the location information and the confidence information. Of course, in practical applications, this information can also be obtained through other means. By marking the corresponding information on the business object in advance, the data processing data and the number of interactions can be effectively saved, and the data processing efficiency is improved.

The location information and/or confidence information of the business object and the sample image of a certain face attribute are used as training samples, and the feature vector is extracted to obtain the feature information including the location information and/or the confidence information of the business object. The vector, and the face feature vector corresponding to the face feature point.

Optionally, the second convolutional network model can be used to simultaneously train the face and the business object. In this case, the feature vector of the sample image also includes the features of the face.

The extraction of the feature vector can be implemented in an appropriate manner in the related art, and details are not described herein again.

(2) Convolution processing of the feature vector to obtain the feature vector convolution result.

In the implementation, the obtained feature vector convolution result includes the location information and/or the confidence information of the business object, and the feature vector convolution result corresponding to the face feature vector corresponding to the face action state. In the case of joint training of faces and business objects, the feature vector convolution result also contains information on the state of the face action.

The number of times of convolution processing on the feature vector can be set according to actual needs, that is, in the second convolutional network model, the number of layers of the convolution layer is set according to actual needs, and details are not described herein again.

The convolution result is the result of feature extraction of the feature vector, which can effectively represent the business object corresponding to the feature of the face in the video image.

In the embodiment of the present application, when the feature vector includes both the location information of the service object and the confidence information of the service object, that is, when the location information and the confidence information of the service object are trained, The eigenvector convolution result is shared in the subsequent judgment of the convergence condition, and no need to perform repeated processing and calculation, which is beneficial to reduce the resource loss caused by data processing, and is beneficial to improve data processing speed and efficiency.

(3) determining whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determining whether the corresponding face feature vector in the feature vector convolution result satisfies the face Convergence conditions.

The convergence condition is appropriately set by a person skilled in the art according to actual needs. When the information satisfies the convergence condition, it can be considered that the network parameters in the second convolutional network model are properly set; when the information cannot satisfy the convergence condition, it can be considered that the network parameters in the second convolutional network model are not properly set and need to be performed. Adjustment, the adjustment may be an iterative process until the result of convolution processing the feature vector using the adjusted network parameters satisfies the convergence condition.

In an optional manner, the convergence condition may be set according to a preset standard location and/or a preset standard confidence, for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold The convergence condition of the confidence information as a business object, and the like.

Optionally, the preset standard location may be an average position obtained by averaging the positions of the service objects in the sample image to be trained; the preset standard confidence may be a business object in the sample image to be trained. The confidence of the average confidence obtained after averaging processing. Since the sample image is a sample to be trained and the amount of data is large, the standard position and/or standard confidence can be set according to the position and/or confidence of the business object in the sample image to be trained, so as to set the standard position and standard confidence. The degree is also more objective and precise.

When determining whether the location information and/or the confidence information of the corresponding service object meets the convergence condition in the feature vector convolution result, an optional manner includes:

Obtaining the location information of the corresponding service object in the feature vector convolution result, and calculating the location indicated by the location information of the corresponding service object by calculating the Euclidean distance between the location indicated by the location information of the corresponding service object and the preset standard location Determining, according to the first distance, a first distance between the preset standard position and determining whether the location information of the corresponding service object satisfies the convergence condition;

and / or,

Obtaining the confidence information of the corresponding service object in the feature vector convolution result, calculating the Euclidean distance between the confidence level indicated by the confidence information of the corresponding service object and the preset standard confidence, and obtaining the confidence of the corresponding business object. A second distance between the confidence level of the information indication and the preset standard confidence, and determining, according to the second distance, whether the confidence information of the corresponding service object satisfies the convergence condition. Among them, the Euclidean distance method is adopted, and the implementation is simple and can effectively indicate whether the convergence condition is satisfied. However, the embodiment of the present application is not limited thereto, and other methods such as a horse distance, a bar distance, and the like may also be adopted.

Optionally, as described above, the preset standard position is an average position obtained by averaging the positions of the business objects in the sample image to be trained; and/or, the preset standard confidence is the sample to be trained. The average confidence obtained by averaging the confidence of the business objects in the image.

The method for determining the face convergence condition that the corresponding face feature vector is satisfied in the feature vector convolution result can be set by a person skilled in the art according to the actual situation, which is not limited by the embodiment of the present application.

(4) If the convergence conditions are satisfied, that is, the location information and/or the confidence information of the service object satisfy the convergence condition of the service object, and the face feature vector satisfies the face convergence condition, the second convolutional network model is completed. Training; otherwise, as long as there is a convergence condition that is not satisfied, for example, the location information and/or confidence information of the business object does not satisfy the convergence condition of the business object, and/or the face feature vector does not satisfy the face convergence condition, adjust the second volume. Network parameters of the network model, and iteratively training the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the location information and/or confidence information of the service object after the iterative training and The face feature vectors satisfy the corresponding convergence conditions.

By performing the above training on the second convolutional network model, the second convolutional network model can feature extracting and classifying the presentation position of the business object based on the face presentation, thereby having the position of determining the presentation position of the business object in the video image. Features. Wherein, when the presentation location includes a plurality of, through the training of the above-mentioned business object confidence, the second convolutional network model may also determine the order of the presentation effects in the plurality of presentation locations, thereby determining the final presentation location. In subsequent applications, when a business object needs to be displayed, a valid presentation location can be determined based on the current image in the video.

In addition, before performing the foregoing training on the second convolutional network model, the sample image may be pre-processed, for example, may include: acquiring a plurality of sample images, where each sample image includes annotation information of the business object; Determining the location of the business object according to the annotation information, determining whether the distance between the determined location of the business object and the preset location is less than or equal to a set threshold; The sample image corresponding to the business object whose threshold is set is determined as the sample image to be trained. The preset position and the set threshold may be appropriately set by any suitable means by a person skilled in the art, for example, according to the statistical analysis result of the data or the related distance calculation formula or the artificial experience, etc., which is not limited by the embodiment of the present application.

By pre-processing the sample image, the sample image that does not meet the conditions can be filtered out to improve the accuracy of the training result.

The training of the second convolutional network model is implemented by the above process, and the trained second convolutional network model can be used to determine the presentation position of the business object in the video image. For example, in the live broadcast process, if the anchor clicks on the business object to indicate the display of the business object, after the second convolutional network model obtains the facial feature point of the anchor in the live video image, the optimal display service object may be indicated. The location, such as the forehead position of the anchor, controls the live application to display the business object at the location; or, during the live broadcast of the video, if the anchor clicks on the business object to indicate the display of the business object, the second convolutional network model can directly be based on the live video. The image determines the presentation location of the business object.

Way two

Determining the presentation position of the business object to be presented in the video image according to the face feature point and the type of the business object to be presented.

In this implementation, after the facial feature points are acquired, the presentation position of the business object to be presented may be determined according to the set rules. Wherein, determining the presentation position of the business object to be presented may include, for example, at least one or any of the following: a hair region of the character in the video image, a forehead region, a cheek region, a chin region, a body region other than the head, and a video image. The background area, the area within the setting range centering on the area where the hand is located in the video image, and the area preset in the video image. The preset area may be appropriately set according to an actual situation, for example, an area within a setting range centering on a face area, or an area within a setting range other than a face area, or a background area, or the like. The embodiment of the present application does not limit this.

After the presentation location is determined, the presentation location of the business object to be presented in the video image may be further determined. For example, the center point of the presentation area corresponding to the presentation location is used to display the business object as the center point of the presentation location; for example, a certain coordinate position in the presentation area corresponding to the presentation location is determined as the center point of the presentation location, etc. This embodiment of the present application does not limit this.

In an optional implementation, when determining the presentation position of the business object to be presented in the video image, the service to be displayed may be determined not only according to the face feature point but also according to the type of the business object to be presented. The position at which the object appears in the video image. The type of the business object includes at least one of the following or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, a virtual headwear type, and a virtual hair. Ornament type, virtual jewelry type, background type, virtual pet type, virtual container type. However, it is not limited thereto, and the type of the business object may be other suitable types, such as a virtual cap type, a virtual cup type, a text type, and the like.

Thus, according to the type of the business object, an appropriate presentation position can be selected for the business object with reference to the face feature point.

In addition, in a case where a plurality of presentation positions of the business object to be presented in the video image are obtained according to the face feature point and the type of the business object to be presented, at least one presentation position may be selected from the plurality of presentation positions as the to-be-position The location of the presented business object in the video image. For example, for a text type business object, it can be displayed in the background area, or it can be displayed on the person's forehead or body area.

In addition, in another example of the embodiments of the present application, the correspondence between the facial motion and the presentation position may be stored in advance, and when the detected facial motion is matched with the corresponding predetermined facial motion, the pre-stored facial motion may be In the correspondence with the presentation position, the target presentation position corresponding to the predetermined facial motion is acquired as the presentation position of the business object to be presented in the video image. It should be noted that, although there is a corresponding relationship between the facial motion and the presentation position, the facial motion is not necessarily related to the presentation position, and the facial motion is only a way to trigger the presentation of the business object, and the position and the face are displayed. There is no necessary relationship, that is, the business object can be displayed in a certain area of the face, or can be displayed in other areas than the face, such as the background area of the video image.

In step S350, the business object to be presented is drawn in a computer drawing manner at the presentation position.

In an alternative example, step S350 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 503 being executed by a processor.

Based on the step S350, when the business object is a sticker containing the semantic information, such as an advertisement sticker, the related information of the business object, such as the identifier and size of the business object, may be acquired before the drawing of the business object is performed. After determining the presentation position, the business object can be scaled, rotated, etc. according to the coordinates of the presentation position, and then, through corresponding drawing methods such as OpenGL graphics. Drawing the way the engine is drawn to the business object. In some cases, ads can also be displayed in 3D special effects, such as text or logos (LOGOs) that display ads through particle effects. For example, when the anchor opens the mouth, the advertising effect of a certain product can be displayed by dynamically reducing the liquid in the cup, and the advertising effect can include a plurality of display images of different states (for example, including multiple frames in which the amount of liquid in the cup is gradually reduced). The video frame composed of the image is sequentially drawn in the display position by a computer drawing method such as drawing of the OpenGL graphics rendering engine, thereby displaying the dynamic effect of gradually reducing the amount of liquid in the cup. In this way, the dynamic display of the advertising effect can attract viewers to watch, improve the fun of advertising and display, and improve the efficiency of advertising and display. The method for processing a video image provided by the embodiment of the present application triggers the presentation of a business object (such as an advertisement) by a facial action. On the one hand, the business object to be presented is drawn at the presentation position by using a computer drawing method, and the business object is played with the video. In combination, it is not necessary to transmit video data of a business object such as an advertisement that is not related to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the facial motion in the video image, It is beneficial to preserve the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to reducing the user's business object displayed in the video image. Resentment can attract the attention of the audience to a certain extent and improve the influence of business objects.

4 is a flow chart of still another embodiment of a method of processing a video image. In this embodiment, the processing solution of the video image in the embodiment of the present application is described by taking the two-dimensional sticker special effect including the advertisement information as an example. As shown in FIG. 4, the processing method of the video image in this embodiment includes:

In step S401, a plurality of sample images including face information are acquired as training samples, wherein the sample images include information of the labeled face action states.

In an alternative example, step S401 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a training sample acquisition module 504 being executed by the processor.

In step S402, the first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting the state of the facial motion in the image.

In an alternative example, step S402 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional network model determination module 505 that is executed by the processor.

The content of the steps in the above steps S401 to S402 is the same as the corresponding content in the embodiment shown in FIG. 2, and details are not described herein again.

In step S403, a feature vector of the sample image of the above training sample is acquired.

The feature vector includes position information and/or confidence information of the business object in the sample image of the training sample, and a face feature vector corresponding to the face action state in the sample image.

Wherein, the face action state in each sample image may be determined when training the first convolutional network model.

In the implementation, there are some sample images in the sample image of the training sample that do not meet the training standard of the second convolutional network model, and this part of the sample image needs to be filtered out by preprocessing the sample image.

First, in this embodiment, each sample image includes a business object, and each business object is marked with location information and confidence information. In an optional implementation, the location information of the central point of the business object is used as the location information of the business object. In this step, the sample image is filtered according to the location information of the business object. After obtaining the coordinates of the location indicated by the location information, the coordinates are compared with the preset location coordinates of the business object of the type, and the position variance of the two is calculated. If the position variance is less than or equal to the set threshold, the sample image may be used as the sample image to be trained; if the position variance is greater than the set threshold, the sample image is filtered out. The preset position coordinates and the set thresholds may be appropriately set by a person skilled in the art according to actual conditions. For example, the images generally used for the second convolutional network model training have the same size, so the set threshold may be The length or width of the image is 1/20 to 1/5, and alternatively, it may be 1/10 of the length or width of the image.

In addition, the location and confidence of the business objects in the determined sample image may be averaged to obtain an average position and an average confidence, which may be used as a basis for determining the convergence condition subsequently.

When the business object is an advertisement sticker as an example, the sample image used for training in this embodiment needs to be labeled with the coordinates of the optimal advertisement position and the confidence of the advertisement position. The size of the confidence indicates the probability that this ad slot is the best ad slot. For example, if this ad slot is mostly occluded, the confidence is low. Wherein, the optimal advertisement position can be marked in a face, a front background, and the like, so that joint training of advertisement points in a facial feature point, a front background, and the like can be realized, which is separately trained with respect to a certain technique based on facial motion or the like. The solution is conducive to saving computing resources.

In an alternative example, step S403 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a eigenvector acquisition module 506 that is executed by the processor.

In step S404, the feature vector is convoluted to obtain a feature vector convolution result.

In this step, when the feature vector is subjected to convolution processing, convolution processing is performed on the position information of the business object in the sample image and/or the feature vector corresponding to the confidence information, and the face in each sample image is also The face feature vector corresponding to the feature point is convoluted, and the corresponding feature vector convolution result is obtained respectively.

In an alternative example, step S404 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolution module 507 executed by the processor.

In step S405, it is determined whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determines whether the corresponding face feature vector in the feature vector convolution result satisfies the face convergence. condition.

In an alternative example, step S405 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convergence condition determination module 508 that is executed by the processor.

In step S406, if the convergence conditions in step S405 are satisfied, that is, the position information and/or the confidence information of the corresponding business object in the feature vector convolution result satisfies the convergence condition of the business object, and the corresponding result in the feature vector convolution result If the face feature vector satisfies the face convergence condition, the training of the second convolutional network model is completed; otherwise, the network parameters of the second convolutional network model are adjusted, and according to the adjusted network parameter pair of the second convolutional network model The second convolutional network model performs iterative training until the position information and/or the confidence information and the face feature vector of the service object after the iterative training satisfy the corresponding convergence condition.

In this embodiment, if the location information and/or the confidence information of the corresponding service object in the feature vector convolution result does not satisfy the service object convergence condition, the location information of the corresponding service object in the feature vector convolution result is / or confidence information, adjust the network parameters of the second convolutional network model, and iteratively train the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the iteratively trained business object The location information and/or the confidence information satisfy the convergence condition of the service object; if the corresponding face feature vector in the feature vector convolution result does not satisfy the face convergence condition, the corresponding face feature vector in the convolution result according to the feature vector Adjusting the network parameters of the second convolutional network model, and iteratively training the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the face feature vector after the iterative training satisfies the face Convergence conditions.

In an alternative example, step S406 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a model training module 509 executed by the processor.

For the specific processing of the foregoing steps S404 to S406, refer to the related content in the foregoing embodiment shown in FIG. 3, and details are not described herein again.

Through the processing of the above steps S403 to S406, the trained second convolutional network model can be obtained. For the structure of the second convolutional network model, reference may be made to the structure of the first convolutional network model in the embodiment shown in FIG. 2, and details are not described herein again.

The first convolutional network model and the second convolutional network model obtained by the above training may perform corresponding processing on the video image, and may specifically include the following steps S407 to S411.

In step S407, the currently played video image containing the face information is acquired.

In step S408, a face key point is extracted from the video image, and a first convolutional network model for detecting a face action state in the image is used, and the video image is determined according to the face motion state in the video image. Facial movements of the face.

In an optional example, steps S407-S408 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a video image detection module 501 that is executed by the processor.

In step S409, when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the facial feature point in the facial region corresponding to the detected facial motion is extracted.

In step S410, according to the face feature point, a pre-trained second convolution network model for determining the presentation position of the business object in the video image is used to determine the presentation position of the business object to be presented in the video image.

In an alternative example, steps S409-S410 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 that is executed by the processor.

In step S411, the business object to be presented is drawn in a computer drawing manner at the presentation position.

In an optional example, step S411 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the processor. The running business object drawing module 503 executes.

With the rise of Internet live broadcast and short video sharing, more and more videos appear as live or short video. Such videos are often dominated by characters (single characters or a small number of characters), with characters and simple backgrounds as the main scenes, and viewers mainly watch on mobile terminals such as mobile phones. Through the solution provided in this embodiment, the video image in the video playing process can be detected in real time, and the advertising placement position with better effect is given, and the user's viewing experience is not affected, and the delivery effect is better; by the business object and the video The combination of play, no need to transmit additional advertising video data irrelevant to the video through the network, is conducive to saving network resources and/or system resources of the client; in addition, the business object to be presented is drawn by computer drawing at the display position, the business object and The facial actions in the video image are closely combined, which can preserve the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and also does not disturb the user to watch the video normally, which is beneficial to reduce The user's dislike of the business objects displayed in the video image, and can attract the attention of the audience to a certain extent, and improve the influence of the business object. It can be understood that, in addition to advertising, business objects can be widely applied to other aspects, such as education, consulting, services, etc., by providing entertainment, appreciation and other business information to improve interaction and improve user experience.

The processing of any of the video images provided by the embodiments of the present application may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device, a server, and the like. Alternatively, the processing of any of the video images provided by the embodiments of the present application may be performed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform processing of any one of the video images mentioned in the embodiments of the present application. This will not be repeated below.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

FIG. 5 is a schematic structural diagram of an embodiment of a processing apparatus for video images of the present application. The video image processing apparatus of the embodiments of the present application can be used to implement the foregoing method for processing each video image of the present application. Referring to FIG. 5, the processing apparatus for the video image of this embodiment includes: a video image detecting module 501, a presentation position determining module 502, and a business object drawing module 503. among them:

The video image detecting module 501 is configured to perform facial motion detection of the face on the currently played video image including the face information.

The presentation location determining module 502 is configured to determine a presentation location of the business object to be presented in the video image when it is determined that the detected facial motion matches the corresponding predetermined facial motion.

The business object drawing module 503 is configured to draw a business object to be presented by using a computer drawing manner at the presentation position.

The video image processing apparatus provided in this embodiment performs facial motion detection on the currently played video image including the face information, and matches the detected facial motion with the corresponding predetermined facial motion, when the two match Determining a presentation position of the business object to be presented in the video image, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, by using a computer drawing method The presentation location draws the business object to be presented, and the business object is combined with the video playback, and does not need to transmit additional advertisement video data irrelevant to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object Closely combined with the facial motion in the video image, it is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to the video. Reduce the business objects that users display on video images Disgusted, to attract the viewer's attention to a certain extent, increase the influence of business objects.

In an optional example of the processing device embodiment of the video image of the present application, the video image detecting module 501 is configured to extract a face key point from the currently played video image containing the face information, and use the pre-trained a first convolutional network model for detecting a state of motion of a face in an image, determining a state of a facial motion of a face in the video image according to the key point of the face, and determining a face of the video image according to a state of the facial motion of the facial face Facial movements.

FIG. 6 is a schematic structural diagram of another embodiment of a processing apparatus for video images of the present application. Referring to FIG. 6, the processing apparatus of the video image further includes:

The training sample obtaining module 504 is configured to acquire a plurality of sample images including face information, wherein the sample image is labeled with information of a face attribute.

a first convolutional network model determining module 505, configured to use the training sample to train the first convolutional network model, and obtain A first convolutional network model that detects the state of the face motion in the image.

Optionally, the training sample obtaining module 504 includes: a sample image acquiring unit, configured to acquire a plurality of sample images including face information; and a face positioning information determining unit configured to detect, in each sample image, the sample image a face and a face key point, the face in the sample image is positioned by the face key point to obtain face positioning information; and the training sample determining unit is configured to use the sample image including the face positioning information as a training sample.

Optionally, the presentation location determining module 502 includes: a feature point extraction unit, configured to extract a facial feature point in a face region corresponding to the detected facial motion; and a presentation location determining unit, configured to use the facial feature point according to the facial feature point Determining a presentation location of the business object to be presented in the video image.

Optionally, the presentation location determining module 502 is configured to determine, according to the facial feature point, a pre-trained second convolutional network model for determining a presentation location of the business object in the video image, to determine the service to be presented. The presentation position of the object in the video image.

Optionally, referring to FIG. 6, the video image processing apparatus of still another embodiment further includes:

The feature vector obtaining module 506 is configured to acquire a feature vector of the sample image of the training sample, where the feature vector includes: location information and/or confidence information of the service object in the sample image, and corresponding to the facial motion in the sample image. a face feature vector corresponding to a face feature point in a face region;

The convolution module 507 is configured to perform convolution processing on the feature vector to obtain a feature vector convolution result;

The convergence condition determining module 508 is configured to determine whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determine whether the corresponding face feature vector in the feature vector convolution result is Meet the face convergence condition;

The model training module 509 is configured to: when the convergence conditions are satisfied, that is, the location information and/or the confidence information of the service object satisfy the convergence condition of the service object, and the face feature vector satisfies the face convergence condition, and the second volume is completed. Training of the product network model; otherwise, when the above convergence conditions are not satisfied, the network parameters of the second convolutional network model are adjusted, and the second convolution network model is based on the adjusted network parameters of the second convolutional network model The iterative training is performed until the position information and/or the confidence information of the business object after the iterative training and the face feature vector satisfy the corresponding convergence condition.

Optionally, the presentation location determining module 502 is configured to determine, according to the facial feature point and the type of the business object to be presented, a presentation location of the business object to be presented in the video image.

Optionally, the presentation location determining module 502 includes: a presentation location obtaining unit, configured to acquire, according to the facial feature point and the type of the business object to be presented, a plurality of presentation locations of the business object to be presented in the video image; And a presentation location selection unit, configured to select at least one presentation location from the plurality of presentation locations as a presentation location of the business object to be presented in the video image.

Optionally, the presentation location determining module 502 is configured to obtain, from a correspondence between the pre-stored facial motion and the presentation location, a target presentation location corresponding to the predetermined facial motion as a presentation of the business object to be presented in the video image. position.

Optionally, the business object includes: an effect including semantic information; the video image is a live video image.

Optionally, the foregoing special effect including the semantic information includes at least one of the following special effects of the advertisement information: a two-dimensional sticker special effect, a three-dimensional special effect, a particle special effect, and the like.

Optionally, the display location comprises at least one or any of the following: a hair area of a person in the video image, a forehead area, a cheek area, a chin area, a body area other than the head, a background area in the video image, and a video image An area within the setting range centering on the area where the hand is located, a predetermined area in the video image, and the like.

Optionally, the type of the business object includes at least one of the following or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, a virtual headwear type, Virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type, and the like.

Optionally, the facial motion of the face includes at least one or any of the following: blinking, kissing, opening the mouth, shaking the head, nodding, laughing, crying, frowning, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left The eyeball moves to the right, turns to the left, turns to the right, looks up, bows, and pouts.

Referring to FIG. 7, a schematic structural diagram of an electronic device according to Embodiment 7 of the present application is shown. The specific embodiment of the present application does not limit the specific implementation of the electronic device. As shown in FIG. 7, the electronic device can include a processor 902, a communications interface 904, a memory 906, and a communications bus 908. among them:

Processor 902, communication interface 904, and memory 906 complete communication with one another via communication bus 908.

The communication interface 904 is configured to communicate with network elements of other devices, such as other clients or servers.

The processor 702 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, or a graphics processor ( Graphics Processing Unit, GPU). The one or more processors included in the terminal device may be the same type of processor, such as one or more CPUs, or one or more GPUs; or may be different types of processors, such as one or more CPUs and One or more GPUs.

The memory 906 is for at least one executable instruction that causes the processor 902 to perform operations corresponding to a method of presenting a business object in a video image as in any of the above-described embodiments of the present application. The memory 906 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.

In addition, the embodiment of the present application further provides another electronic device, including: a processor and a video image processing apparatus according to any one of the foregoing embodiments; when the processor runs the video image processing device, The unit in the processing apparatus applying for the video image described in any of the above embodiments is operated.

FIG. 8 is a schematic structural diagram of another embodiment of an electronic device according to the present invention. Referring to FIG. 8, there is shown a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server of an embodiment of the present application. As shown in FIG. 8, the electronic device includes one or more processors, communication units, etc., such as one or more central processing units (CPUs) 801, and/or one or more An image processor (GPU) 813 or the like, the processor may execute various kinds according to executable instructions stored in a read only memory (ROM) 802 or executable instructions loaded from the storage portion 808 into the random access memory (RAM) 803. Proper action and handling. Communication portion 812 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card, and the processor can communicate with read only memory 802 and/or random access memory 803 to execute executable instructions over bus 804. The communication unit 812 is connected to the communication unit 812 and communicates with other target devices to complete the operation corresponding to the processing method of any video image provided by the embodiment of the present application. For example, the currently played video image including the face information is performed. Facial motion detection of a face; determining a presentation position of a business object to be presented in the video image when the detected facial motion matches a corresponding predetermined facial motion; drawing a computer drawing manner at the presentation position The business object to be presented.

Further, in the RAM 803, various programs and data required for the operation of the device can be stored. The CPU 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804. In the case of RAM 803, ROM 802 is an optional module. The RAM 803 stores executable instructions or writes executable instructions to the ROM 802 at runtime, the executable instructions causing the processor 801 to perform operations corresponding to the processing methods of the video images described above. An input/output (I/O) interface 805 is also coupled to bus 804. The communication unit 812 may be integrated or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards) and on the bus link.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, etc.; an output portion 807 including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 808 including a hard disk or the like. And a communication portion 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the Internet. Driver 811 is also connected to I/O interface 805 as needed. A removable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like, is mounted on the drive 811 as needed so that a computer program read therefrom is installed into the storage portion 808 as needed.

It should be noted that the architecture shown in FIG. 8 is only an optional implementation manner. In a specific practice, the number and types of the components in FIG. 8 may be selected, deleted, added, or replaced according to actual needs; Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more. These alternative embodiments are all within the scope of the present disclosure.

In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with an embodiment of the present disclosure. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing an instruction corresponding to the method step provided by the embodiment of the present application, for example, performing a facial motion detection of a face on a currently played video image including face information; and when the detected facial motion matches a corresponding predetermined facial motion, Determining a presentation location of the business object to be presented in the video image; The business object to be presented is drawn in a computer drawing manner at the presentation position.

In addition, the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device performs the implementation of any embodiment of the present application. Instructions for each step in the processing method of the video image.

In addition, the embodiment of the present application further provides a computer readable storage medium for storing computer readable instructions, which are executed to implement the operations of the steps in the video image processing method of any embodiment of the present application.

For the specific implementation of the steps of the computer program and the computer readable command, the corresponding steps in the above-mentioned embodiments and corresponding descriptions in the modules are not described herein. A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the device and the module described above may be referred to the corresponding process description in the foregoing method embodiment, and details are not described herein again.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. For a device, an electronic device, a program, a storage medium, and the like, the description is relatively simple because it basically corresponds to the method embodiment. For related parts, refer to the description of the method embodiment.

It should be pointed out that the various steps/components described in the present application can be split into more steps/components according to the needs of the implementation, and two or more steps/components or partial operations of the steps/components can be combined into new ones. Steps/components to achieve the objectives of the present application.

The above method according to the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network. The computer code originally stored in a remote recording medium or non-transitory machine readable medium and to be stored in a local recording medium, whereby the methods described herein can be stored using a general purpose computer, a dedicated processor, or programmable or dedicated Such software processing on a recording medium of hardware such as an ASIC or an FPGA. It will be understood that a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is The processing methods described herein are implemented when the processor or hardware is accessed and executed. Moreover, when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code converts the general purpose computer into a special purpose computer for performing the processing shown herein.

The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims

A method for processing a video image, comprising:

Performing a face motion detection of a face on a currently played video image containing face information;

Determining a presentation position of the business object to be presented in the video image when the detected facial motion matches the corresponding predetermined facial motion;

The business object to be presented is drawn in a computer drawing manner at the presentation position.
The method according to claim 1, wherein the performing facial motion detection of the face on the currently played video image containing the face information comprises:

Extracting a face key point from a currently played video image containing face information, using a pre-trained first convolutional network model for detecting a face action state in the image, determining the location according to the face key point The face action state in the video image is determined, and the face action of the face in the video image is determined according to the face action state in the video image.
The method of claim 2, wherein pre-training the first convolutional network model comprises:

Obtaining a plurality of sample images including face information, wherein the sample images are labeled with information of a human face action state;

The first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting a state of facial motion in the image.
The method according to claim 3, wherein acquiring a plurality of sample images including face information as training samples comprises:

Obtaining a plurality of sample images including face information;

For each of the sample images, detecting face and face key points in the sample image, and positioning the face in the sample image by the face key point to obtain face positioning information;

The sample image including the face location information is used as a training sample.
The method according to any one of claims 1-4, wherein the determining a presentation location of the business object to be presented in the video image comprises:

Obtaining a facial feature point in a face region corresponding to the detected facial motion;

Determining, according to the face feature point, a presentation location of the business object to be presented in the video image.
The method according to claim 5, wherein the determining, according to the facial feature point, the presentation location of the business object to be presented in the video image comprises:

Determining, according to the face feature point, a pre-trained second convolution network model for determining a presentation position of the business object in the video image, the presentation of the business object to be presented in the video image position.
The method of claim 6 wherein the pre-training of the second convolutional network model comprises:

Obtaining a feature vector of the sample image of the training sample, where the feature vector includes: location information and/or confidence information of the business object in the sample image, and a face region corresponding to the facial action in the sample image a face feature vector corresponding to a face feature point;

Convoluting the feature vector to obtain a feature vector convolution result;

Determining whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determining whether the corresponding face feature vector in the feature vector convolution result satisfies the face convergence condition;

If all are satisfied, completing training on the second convolutional network model;

Otherwise, the network parameters of the second convolutional network model are adjusted, and the second convolutional network model is iteratively trained according to the adjusted network parameters of the second convolutional network model until the iteratively trained business object The location information and/or the confidence information and the face feature vector both satisfy respective convergence conditions.
The method according to claim 5, wherein the determining, according to the facial feature point, the presentation location of the business object to be presented in the video image comprises:

Determining, according to the face feature point and the type of the business object to be presented, a presentation location of the business object to be presented in the video image.
The method according to claim 8, wherein the display location of the business object to be presented in the video image is determined according to the type of the face feature point and the type of the business object to be presented, including :

Acquiring, according to the face feature point and the type of the business object to be presented, a plurality of display positions of the business object to be presented in the video image;

Selecting at least one presentation location from the plurality of presentation locations as a presentation location of the business object to be presented in the video image.
The method according to any one of claims 1-4, wherein the determining a presentation location of the business object to be presented in the video image comprises:

The target presentation position corresponding to the predetermined facial motion is obtained as a presentation position of the business object to be presented in the video image from a correspondence relationship between the pre-stored facial motion and the presentation position.
The method according to any one of claims 1 to 10, wherein the business object comprises: an effect comprising semantic information; and the video image comprises: a live video type image.
The method according to claim 11, wherein the special effect including the semantic information comprises: one or any of a plurality of forms of special effects including the two-dimensional sticker effect, the three-dimensional special effect, and the particle special effect.
The method according to any one of claims 1 to 12, wherein the display position comprises one or more of the following: a hair area of a person in a video image, a forehead area, a cheek area, a chin area, and a head The body area, the background area in the video image, the area within the set range of the video image centered on the area where the hand is located, and the area preset in the video image.
The method according to any one of claims 1 to 13, wherein the type of the business object comprises one or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat. Type, virtual clothing type, virtual makeup type, virtual headdress type, virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type.
The method according to any one of claims 1 to 14, wherein the facial motion of the human face comprises one or more of the following: blinking, kissing, opening a mouth, shaking his head, nodding, laughing, crying, frowning, closing the left Eyes, closed right eyes, closed eyes, eyeballs move to the left, eyeballs move to the right, turn left, turn to the right, raise your head, bow your head, pout.
A video image processing apparatus, comprising:

a video image detecting module, configured to perform facial motion detection of a face on a currently played video image including face information;

a presentation location determining module, configured to determine a presentation location of the business object to be presented in the video image when the facial motion detected by the video image detection module matches a corresponding predetermined facial motion;

The business object drawing module draws the business object to be presented in a computer drawing manner at the presentation position.
The device according to claim 16, wherein the video image detecting module is configured to acquire a face key point from a currently played video image containing face information, and use the pre-trained image for detecting the image. a first convolutional network model of a face motion state, determining a state of a facial motion of the face in the video image according to the face key point, and determining the video image according to a face motion state in the video image The facial movement of the face.
The device according to claim 17, further comprising:

a training sample obtaining module, configured to acquire a plurality of sample images including face information, wherein the sample image is labeled with information of a human face action state;

A first convolutional network model determining module is configured to train the first convolutional network model using the training samples to obtain a first convolutional network model for detecting a state of a face action in an image.
The apparatus according to claim 18, wherein the training sample acquisition module comprises:

a sample image obtaining unit, configured to acquire a plurality of sample images including face information;

a face positioning information determining unit, configured to detect a face and a face key point in the sample image for each of the sample images, and locate a face in the sample image by the face key point to obtain a face Positioning information;

A training sample determining unit is configured to use the sample image including the face positioning information as a training sample.
The device according to any one of claims 16 to 19, wherein the presentation location determining module comprises:

a feature point extracting unit, configured to acquire a face feature point in a face region corresponding to the detected facial motion;

And a presentation location determining unit, configured to determine, according to the facial feature point, a presentation location of the business object to be presented in the video image.
The device according to claim 20, wherein the presentation position determining module is configured to use, according to the facial feature point, a pre-trained first for determining a presentation position of a business object in a video image. A two-convolution network model determines a presentation location of the business object to be presented in the video image.
The device according to claim 21, further comprising:

a feature vector obtaining module, configured to acquire a feature vector of a sample image of the training sample, where the feature vector includes: location information and/or confidence information of the business object in the sample image, and facial motion in the sample image a face feature vector corresponding to a face feature point in the corresponding face region;

a convolution module, configured to perform convolution processing on the feature vector to obtain a feature vector convolution result;

a convergence condition determining module, configured to determine whether location information and/or confidence information of a corresponding service object in the feature vector convolution result satisfies a service object convergence condition, and determine a corresponding face in the feature vector convolution result Whether the feature vector satisfies the face convergence condition;

a model training module, configured to perform training on the second convolutional network model if satisfied; otherwise, adjusting network parameters of the second convolutional network model and according to the adjusted network of the second convolutional network model The parameter performs iterative training on the second convolutional network model until the position information and/or the confidence information of the service object after the iterative training and the face feature vector satisfy the corresponding convergence condition.
The device according to claim 20, wherein the presentation location determining module is configured to determine, according to the face feature point and the type of the business object to be presented, the business object to be presented The presentation position in the video image.
The device according to claim 23, wherein the presentation location determining module comprises:

a presentation location obtaining unit, configured to acquire, according to the face feature point and the type of the business object to be presented, a plurality of presentation locations of the business object to be presented in the video image;

A presentation location selection unit is configured to select at least one presentation location from the plurality of presentation locations.
The apparatus according to any one of claims 16 to 19, wherein the presentation position determining module is configured to acquire a target presentation corresponding to the predetermined facial motion from a correspondence between a pre-stored facial motion and a presentation position. The location as the presentation location of the business object to be presented in the video image is used as a presentation location of the business object to be presented in the video image.
The device according to any one of claims 16-25, wherein the business object comprises: an effect comprising semantic information; and the video image comprises: a live video image.
The apparatus according to claim 26, wherein the special effect including the semantic information comprises one or any of the following special effects including the advertisement information: a two-dimensional sticker effect, a three-dimensional special effect, and a particle special effect.
The device according to any one of claims 16-27, wherein the display position comprises one or more of the following: a hair area of a person in the video image, a forehead area, a cheek area, a chin area, and a head. The body area, the background area in the video image, the area within the set range of the video image centered on the area where the hand is located, and the area preset in the video image.
The device according to any one of claims 16-28, wherein the type of the business object comprises one or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat. Type, virtual clothing type, virtual makeup type, virtual headdress type, virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type.
The device according to any one of claims 16-29, wherein the facial motion of the human face comprises one or more of the following: blinking, kissing, opening a mouth, shaking his head, nodding, laughing, crying, frowning, closing the left / Right / eyes, pout.
An electronic device comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete communication with each other through the communication bus;

The memory is configured to store at least one executable instruction that causes the processor to perform an operation corresponding to the method of processing a video image according to any of claims 1-15.
An electronic device, comprising:

A processor and a processing apparatus for a video image according to any of claims 16-30;

The unit in the processing apparatus of the video image of any of claims 16-30 is operated while the processor is running the processing device of the video image.
A computer program comprising computer readable code, wherein a processor in the device executes a video for implementing any of claims 1-15 when the computer readable code is run on a device The instructions for each step in the image processing method.
A computer readable storage medium for storing computer readable instructions, wherein the instructions are executed to perform the operations of the steps of the video image processing method of any of claims 1-15.