WO2018033155A1 - Video image processing method, apparatus and electronic device - Google Patents

Video image processing method, apparatus and electronic device Download PDF

Info

Publication number
WO2018033155A1
WO2018033155A1 PCT/CN2017/098201 CN2017098201W WO2018033155A1 WO 2018033155 A1 WO2018033155 A1 WO 2018033155A1 CN 2017098201 W CN2017098201 W CN 2017098201W WO 2018033155 A1 WO2018033155 A1 WO 2018033155A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
video image
business object
information
image
Prior art date
Application number
PCT/CN2017/098201
Other languages
French (fr)
Chinese (zh)
Inventor
栾青
彭义刚
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Publication of WO2018033155A1 publication Critical patent/WO2018033155A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression

Definitions

  • the present application relates to artificial intelligence technology, and in particular, to a video image processing method, apparatus, and electronic device.
  • Internet video is considered a premium resource for ad placement because it can be an important entry point for business traffic.
  • Existing video advertisements are mainly inserted into a fixed-time advertisement at a certain time of video playback, or placed in a fixed position in the area where the video is played and its surrounding area.
  • the embodiment of the present application provides a solution for processing a video image.
  • a method for processing a video image includes: performing facial motion detection of a face on a currently played video image including face information; and detecting a facial motion and a corresponding predetermined facial surface When the actions match, the presentation position of the business object to be presented in the video image is determined; and the business object to be presented is drawn by using a computer drawing manner at the presentation position.
  • a video image processing apparatus including: a video image detecting module, configured to perform facial motion detection of a face on a currently played video image including face information; and display position determination a module, configured to determine a presentation position of a business object to be presented in the video image when a facial motion detected by the video image detecting module matches a corresponding predetermined facial motion; a business object rendering module, The presentation location draws the business object to be presented in a computer drawing manner.
  • an electronic device includes: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete each other through the communication bus
  • the memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the processing method of the video image according to any of the above embodiments of the present application.
  • another electronic device including:
  • a processor and a video image processing apparatus according to any of the above embodiments of the present application.
  • the processor runs the processing device of the video image
  • the unit in the video image processing apparatus according to any of the above embodiments of the present application is executed.
  • a computer program comprising computer readable code, when a computer readable code is run on a device, a processor in the device performs the above-described An instruction of each step in the method of processing a video image according to an embodiment.
  • a computer readable storage medium for storing computer readable instructions, when executed, to implement the video image described in any of the above embodiments of the present application. The operation of each step in the processing method.
  • facial motion detection is performed on a currently-played video image including face information, and the detected facial motion is matched with a corresponding predetermined facial motion.
  • FIG. 1 is a flow chart of an embodiment of a method for processing a video image of the present application
  • FIG. 2 is a flowchart of an embodiment of a method for acquiring a first convolutional network model in an embodiment of the present application
  • FIG. 3 is a flow chart of another embodiment of a method for processing a video image of the present application.
  • FIG. 4 is a flow chart of still another embodiment of a method for processing a video image of the present application.
  • FIG. 5 is a schematic structural diagram of an embodiment of a processing apparatus for video images of the present application.
  • FIG. 6 is a schematic structural diagram of another embodiment of a processing apparatus for video images of the present application.
  • FIG. 7 is a schematic structural diagram of an embodiment of an electronic device of the present application.
  • FIG. 8 is a schematic structural diagram of another embodiment of an electronic device of the present application.
  • Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
  • Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • FIG. 1 is a flow chart of an embodiment of a method for processing a video image of the present application.
  • the processing method of the video image of each embodiment of the present application may be exemplarily executed by an electronic device such as a computer system, a terminal device, or a server.
  • an electronic device such as a computer system, a terminal device, or a server.
  • a method for processing a video image of this embodiment includes:
  • step S110 face motion detection of the face is performed on the currently played video image including the face information.
  • facial actions may include, but are not limited to, blinking, opening, nodding, and pouting.
  • the face information may include, for example, information related to the face, eyes, mouth, nose, and/or hair, and the like.
  • the video image may be an image in a live video that is being broadcast live, or a video image that has been recorded or is in the process of being recorded.
  • a live video is taken as an example.
  • the live video platform includes multiple, such as a pepper live broadcast platform, a YY live broadcast platform, etc., each live broadcast platform includes multiple live broadcast rooms.
  • Each live room will include at least one anchor, and the anchor can broadcast a video to the fans in the live room where the electronic device is located, the live video includes multiple video images.
  • the subject in the above video image is usually a main character (ie, anchor) and a simple background, and the anchor often occupies a larger area in the video image.
  • the video image in the current live video can be obtained, and the video image can be detected by a preset face detection mechanism to determine the Whether the face information of the anchor is included in the video image, if the face information of the anchor is included, the video image is acquired or recorded for subsequent processing; if the face information of the anchor is not included, the video image of the next frame may be continued.
  • the above related processing is performed to obtain a video image in which the video image includes the face information of the anchor.
  • the video image may also be a video image in a short video that has been recorded.
  • the user can play the short video using the electronic device, and during the playing process, the electronic device can detect each frame of the video image.
  • the video image is acquired for subsequent processing; if the face information of the anchor is not included, the video image may be discarded or not processed for the video image. And acquire the next frame of video image to continue the above processing.
  • the user can use his electronic device to detect whether the video image of each frame is included in the recorded video image of the anchor, if the person including the anchor For the face information, the video image is acquired for subsequent processing; if the face information of the anchor is not included, the video image may be discarded or not processed, and the next frame of the video image may be acquired to continue the above processing.
  • the electronic device that plays the video image or the electronic device used by the anchor is provided with a mechanism for detecting the facial motion of the face of the video image, and the facial motion detection mechanism can perform facial action on the currently played video image including the face information.
  • an optional processing procedure may be: the electronic device acquires a video image currently being played, and may be from the video image through a preset mechanism The image including the face region is cut out, and then the image of the face region can be analyzed and feature extracted, and the feature data of each part (including the eyes, the mouth, the face, and the like) in the face region is obtained, by using the feature data.
  • Analysis determine the meaning of the facial motion of the face in the video image, that is, the facial motion belongs to blinking, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left, moving the eye to the right, turning to the left, turning to the right Which one or which of the head, head, head down, nodding, laughing, crying, frowning, opening your mouth, nodding or pouting .
  • step S110 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by video image detection module 501 being executed by the processor.
  • step S120 when the detected facial motion matches the corresponding predetermined facial motion, the presentation position of the business object to be presented in the video image is determined.
  • the business object is an object created according to a certain business requirement, and may include, for example, but not limited to, advertisement, entertainment, weather forecast, traffic forecast, pet, and the like.
  • the presentation position may be a center position of a designated area in the video image, or may be a coordinate of a plurality of edge positions in the specified area or the like.
  • feature data of a plurality of different facial actions may be pre-stored, and different facial actions are marked correspondingly to distinguish the meaning represented by each facial action.
  • the facial motion of the facial face can be detected from the video image, and the feature data of the detected facial motion of the facial face can be compared with the feature data of each facial motion stored in advance, respectively.
  • the feature data of the plurality of different facial actions stored in advance includes a facial motion that matches the feature data of the facial motion of the detected face, and then the detected facial motion can be determined to match the corresponding predetermined facial motion.
  • the matching result may be determined by calculation.
  • a matching algorithm may be set to calculate the matching degree between the feature data of any two facial actions.
  • the matching algorithm may be used to detect The feature data of the facial motion of the face is matched with the feature data of any type of facial motion stored in advance, and the matching degree value between the two is obtained, and the characteristics of the detected facial motion of the face are respectively calculated by the above manner. Data and characteristics of each type of facial action stored in advance The value of the matching degree between the data, the largest matching degree value is selected from the obtained matching degree value, and if the maximum matching degree value exceeds the predetermined matching threshold, the pre-stored facial action corresponding to the largest matching degree value may be determined. Matches the detected facial movements. If the maximum matching degree value does not exceed the predetermined matching threshold, the matching fails, that is, the detected facial motion is not a predetermined facial motion, and at this time, the processing of the above step S110 may be continued on the subsequent video image.
  • the meaning represented by the matched facial motion may be determined first, and the meaning of the matched facial motion may be selected in a plurality of preset display positions or The corresponding presentation position is used as the presentation position of the business object to be presented in the video image. For example, taking a live video as an example, when detecting that the anchor performs a facial motion of the beep, the mouth region may be selected as a related or related presentation position.
  • step S120 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 being executed by the processor.
  • step S130 the business object to be presented is drawn in a computer drawing manner at the presentation position.
  • the business object in order to increase the visual effect of the business object and improve the interest of the video image, dynamic effects can be set for the business object.
  • the business object can be presented as a video, or can be dynamically displayed by multiple images. The way the presentation is presented, etc.
  • the corresponding business object such as an advertisement image with a predetermined product identifier
  • the corresponding business object may be drawn in the area where the mouth of the anchor of the video image is located, for example.
  • the user can click on the area where the business object is located, and the electronic device of the fan can obtain the network link corresponding to the business object, and enter the page related to the business object through the network link, and the fan can This page gets the resources related to this business object.
  • the business object may be drawn by using a computer drawing manner, and may be implemented by appropriate computer graphics image drawing or rendering, for example, but not limited to: based on Open Graphics Language (OpenGL).
  • OpenGL defines a professional graphical program interface for cross-programming language and cross-platform programming interface specifications. It is hardware-independent and can easily draw two-dimensional (2D) or three-dimensional (3D) graphics images.
  • 2D two-dimensional
  • 3D three-dimensional
  • the application is not limited to the drawing method based on the OpenGL graphics rendering engine, and other methods may be adopted.
  • the drawing method based on the graphics engine (Unity) or the Open Computing Language (OpenCL) is also applicable to the present invention. Apply for each embodiment.
  • step S130 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 503 being executed by a processor.
  • the video image processing method performs facial motion detection on the currently played video image including the face information, and matches the detected facial motion with the corresponding predetermined facial motion, when the two match Determining a presentation position of the business object to be presented in the video image, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, by using a computer drawing method
  • the presentation location draws the business object to be presented, and the business object is combined with the video playback, and does not need to transmit additional advertisement video data irrelevant to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object Closely combined with the facial motion in the video image, it is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to the video.
  • the process of detecting the facial motion of the face of the video image in step S110 may be implemented by using a corresponding feature extraction algorithm or using a neural network model such as a convolutional network model.
  • the convolutional network model is taken as an example to perform facial motion detection on a video image.
  • a first convolutional network model for detecting a facial motion state in the image may be pre-trained.
  • 2 is a flow chart of an embodiment of a method for acquiring a first convolutional network model in an embodiment of the present application.
  • the method for acquiring the first convolutional network model of the embodiment may be performed by any device having the functions of data collection, processing, and transmission, including but not limited to a mobile terminal and a personal computer (PC).
  • the training sample may be obtained in a plurality of manners, and the training sample may be a plurality of sample images including face information, and the information of the face motion state is marked in the sample image, the face The information of the action state is used to determine the facial motion of the face.
  • the first convolution of this embodiment The methods for obtaining the network model include:
  • step S210 a plurality of sample images including face information are acquired, wherein the sample images are labeled with information of the face action state.
  • the face information may include local attribute information and global attribute information, etc.
  • the local attribute information may include, for example but not limited to: hair color, hair length, eyebrow length, eyebrow thick or sparse, eye size
  • the global attribute information may include, for example, but not limited to: race, gender, age, etc., for the eyes to open or close, the height of the nose, the size of the mouth, the opening or closing of the mouth, whether to wear glasses, whether to wear a mask, or the like.
  • the sample image may be an image in a video or a plurality of images continuously photographed, or may be any image, and may include an image including a human face and an image including no human face.
  • the face action state that is, the current action state of the face, for example, may include, but is not limited to, the face action belongs to blinking, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left, moving the eye to the right, turning to the left, Turn your head to the right, look up, bow your head, nod, laugh, cry, frown, open your mouth, nod or pout.
  • the sample image may be an image that satisfies a preset resolution condition.
  • the above preset resolution condition may be: the longest side of the image does not exceed 640 pixels, the shortest side does not exceed 480 pixels, and the like.
  • the sample image may be obtained by an image acquisition device, wherein the image acquisition device for collecting the face information of the user may be a dedicated camera or a camera integrated in other devices or the like.
  • the acquired image may not satisfy the above preset resolution condition, in order to obtain a sample image that satisfies the above preset resolution condition, in the present application
  • the acquired image is subjected to scaling processing to obtain a sample image that meets the condition.
  • the face motion state can be marked in each sample image, such as smile, pout, closed left eye, closed right eye, closed eyes, etc., and the face motion state marked in each sample image can be This sample image is stored as training data.
  • the face in the sample image can be positioned to obtain the exact position of the face in the sample image. For details, refer to the following step S220.
  • step S220 for each sample image, a face and a face key point in the sample image are detected, and a face in the sample image is positioned by the face key point to obtain face location information.
  • each face has a certain feature point, such as a corner of the eye, an end of the eyebrow, a corner of the mouth, a tip of the nose, and the like, and a boundary point of the face, etc., in which the feature point of the face is obtained.
  • a face key point can be used to calculate a mapping of a face in the sample image to a preset standard face or a similar transformation, and the face in the sample image is aligned with the standard face, so that the sample image is The face is positioned to obtain the positioning information of the face in the sample image.
  • step S230 a sample image containing face positioning information is taken as a training sample.
  • steps S210-S230 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a training sample acquisition module 504 that is executed by the processor.
  • step S240 the first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting the state of the facial motion in the image.
  • the front end of the first convolutional network model may include a combination of multiple convolutional layers, pooling layers, and non-linear layers, and the back end may be a loss layer, such as based on a cost function (softmax) and / or loss layer of an algorithm such as cross entropy.
  • a loss layer such as based on a cost function (softmax) and / or loss layer of an algorithm such as cross entropy.
  • the structure of the first convolutional network model may include:
  • Input layer This input layer is used to read the sample image and the information of the marked face action state.
  • the input layer may preprocess the sample image, output a face image or face information including positioning information, and the like.
  • the input layer outputs the preprocessed face image to the convolution layer, and inputs the preprocessed face information to the loss layer.
  • Convolutional layer The input is a pre-processed face image or image feature, and the features of the face image are obtained by a predetermined linear transformation output.
  • Nonlinear layer The nonlinear layer can nonlinearly transform the features of the convolutional layer input through nonlinear functions, so that the characteristics of its output have strong expression ability.
  • the pooling layer can map multiple values to a single value. Therefore, the pooling layer can enhance the nonlinearity of the learned features, and can make the spatial size of the output features smaller, thereby enhancing learning.
  • the translation of the feature ie face translation
  • the output feature of the pooling layer can be used again as the input data of the convolution layer or the input data of the fully connected layer.
  • the convolution layer, the nonlinear layer and the pooling layer may be repeated one or more times, that is, the combination of the convolution layer, the nonlinear layer and the pooling layer may be repeated one or more times, wherein, for each time, the pooling layer
  • the output data can be used as the re-input data for the convolutional layer. Multiple combinations of the convolutional layer, the nonlinear layer and the pooled layer can better process the input sample image, so that the features in the sample image have the best expression ability.
  • Fully connected layer linearly transforms the input data of the pooled layer, and projects the learned features into a better subspace to facilitate the prediction of the face state.
  • Nonlinear layer As with the function of the aforementioned nonlinear layer, the input characteristics of the fully connected layer are nonlinearly transformed. Its output characteristics can be used as input data for the loss layer or as input data for the fully connected layer again.
  • the fully connected layer and the nonlinear layer may be repeated one or more times.
  • One or more loss layers mainly responsible for calculating the error of the predicted face action state and the input face action state.
  • the network parameter in the first convolutional network model can be trained by the gradient descent algorithm passed backwards, so that the input layer can input the image, and the face action state corresponding to the face in the input image can be output. Information to obtain a first convolutional network model.
  • the input layer is responsible for simply processing the input sample image
  • the combination of the convolutional layer, the nonlinear layer and the pooling layer is responsible for feature extraction of the sample image
  • the fully connected layer and the nonlinear layer are extracted features to face information.
  • the mapping, loss layer is responsible for calculating the prediction error.
  • step S240 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional network model determination module 505 being executed by the processor.
  • the first convolutional network model obtained by the training can facilitate subsequent facial motion detection on the currently played video image including the face information, and match the detected facial motion with the corresponding predetermined facial motion. Determining a presentation position of the business object to be presented in the video image when the two match, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, Since the business object to be presented is drawn in the presentation position by using the computer drawing method, the business object is combined with the video playing, and no additional advertising video data irrelevant to the video is transmitted through the network, which is beneficial to saving network resources and/or system resources of the client; On the other hand, the business object is closely combined with the facial motion in the video image, which is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image without disturbing the user. Normal viewing of the video helps to reduce the user's exposure to the video image Objectionable business objects, to attract
  • the service object is a special effect including semantic information.
  • the special effect including the semantic information may include: at least one or any of the following special effects including the advertisement information: two-dimensional sticker special effect, three-dimensional Special effects, particle effects, etc.
  • the video image may be a live video image, such as a video image when a live broadcast of an anchor in the pepper live broadcast platform.
  • the processing method of the video image of this embodiment includes:
  • step S310 the currently played video image containing the face information is acquired.
  • step S310 For the specific processing of the foregoing step S310, refer to the related content of the step S110 in the embodiment shown in FIG. 1 , and details are not described herein again.
  • step S320 a face key point is extracted from the video image, and a first convolutional network model for detecting a face action state in the image is used, and a face of the video image is determined according to the face key point. Facial movements.
  • the video image may be detected to determine whether a human face is included in the video image. If it is determined that the video image includes a human face, the face key point is extracted in the video image.
  • the acquired video image and face key points can be input into a first convolutional network model, which is trained, for example, by the embodiment shown in FIG. 2 above.
  • the video image can be separately processed, such as feature extraction, mapping and transformation, to detect the motion of the face of the video image, and obtain the state of the face motion in the video image, thereby based on the human The action state of the face, which can determine the face of the face contained in the video image Department action.
  • the facial motion of the type is divided into a plurality of facial motion states.
  • the blinking state and the closed eye state may be divided into: the above processing may specifically: extract a key point of the face from the video image, and use the pre-training.
  • the first convolutional network model for detecting the state of the face motion in the image determines the state of the face motion in the video image, and determines the facial motion of the face in the video image according to the state of the face motion in the video image.
  • a plurality of video images currently containing the face information may be acquired, and the continuity of the plurality of video images may be determined to determine whether the plurality of video images are continuous in space and time. If it is judged to be discontinuous, the authentication fails or the user is prompted to reacquire the video image.
  • each frame of video image can be divided into 3 ⁇ 3 regions, and a color histogram, a mean and variance of gray scales, and a histogram of two adjacent face images are established on each region.
  • the distance of the graph, the distance of the gray mean and the distance of the gray variance are treated as feature vectors, and the linear classifier is used to determine whether the linear classifier is greater than or equal to zero.
  • the parameters in the linear classifier can be trained by the sample data with the annotation information. If the linear classifier is judged to be greater than or equal to zero, then the two adjacent video images are considered to be continuous in time and space. At this time, the corresponding person may be determined based on the face key points extracted from each video image. a face action state to determine a face motion exhibited by a plurality of consecutive video images; if the linear classifier is judged to be less than zero, it is considered that the adjacent two video images are discontinuous in time and space, When the current video image is used as a starting point, the processing of the above step S310 is continued to be performed on the subsequent video image.
  • the first convolutional network model may be used to determine the state of the facial motion of the face in the video image of the frame based on the face key points extracted from each video image, for example, In the case of blinking, the probability of the blink state or the probability of the closed eye state can be calculated at this time to determine the state of the face motion in the video image.
  • an image block (including face information) can be extracted near the center of the key point corresponding to the blinking action, and the state of the face action state can be obtained by the first convolutional network model. Then, the facial motion of the face in the video image can be determined based on the state of the face motion in each video image.
  • a face motion state can be determined by a face action state (eg, smile, mouth opening, beeping, etc.)
  • a detected video with a face state of a smile, a mouth opening, or a beep can be detected.
  • the image can be determined according to the processing of the above step S320 to determine the facial motion of the corresponding face.
  • step S320 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by video image detection module 501 being executed by the processor.
  • step S330 when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the facial feature point in the facial region corresponding to the detected facial motion is extracted.
  • the face for each video image including face information, the face includes certain feature points, such as feature points such as eyes, nose, mouth, and facial contour.
  • the detection of the face in the video image and the determination of the feature point can be implemented in any suitable related art, which is not limited in the embodiment of the present application.
  • linear feature extraction methods such as principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), etc.
  • nonlinear feature extraction methods such as kernel principal component analysis (Kernel PCA), manifolds
  • Kernel PCA kernel principal component analysis
  • the learning and the like can also be performed using the trained neural network model, such as the convolutional network model in the embodiment of the present application, which is not limited in the embodiment of the present application.
  • the face is detected from the video image of the live video and the face feature point is determined; for example, during the playback of a recorded video, The face image is detected in the played video image and the face feature point is determined; for example, in the recording process of a certain video, the face is detected from the recorded video image and the face feature point is determined.
  • step S340 according to the face feature point, the presentation position of the business object to be presented in the video image is determined.
  • steps S330-S340 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 that is executed by the processor.
  • the one or more presentation positions of the business object to be presented in the video image may be determined based on the basis.
  • the optional implementation manner when determining the presentation position of the business object to be presented in the video image according to the facial feature point, includes, but is not limited to, the following manner: manner one, according to the facial feature point, Determining a presentation location of the business object to be presented in the video image using a pre-trained second convolutional network model for determining a presentation location of the business object in the video image; In the second manner, the presentation location of the business object to be presented in the video image is determined according to the face feature point and the type of the business object to be presented.
  • a convolutional network model that is, a second convolutional network model
  • the trained second convolutional network model has the determined service object
  • the function of presenting the position in the video image; alternatively, a convolutional network model that has been trained by a third party to have the function of determining the presentation position of the business object in the video image can also be used directly.
  • the training of the business object is taken as an example, but those skilled in the art should understand that the second convolutional network model can also train the business object at the same time.
  • the face is trained to achieve joint training of faces and business objects.
  • an optional training method includes the following process:
  • the feature vector includes position information and/or confidence information of the business object in the sample image of the training sample, and a face feature vector corresponding to the face feature point in the face region corresponding to the facial motion in the sample image.
  • the confidence information of the business object indicates the probability that the business object can achieve the effect (such as being focused or clicked or viewed) when the current location is displayed.
  • the probability may be set according to the statistical analysis result of the historical data, or may be According to the results of the simulation experiment, it can also be set according to the artificial experience.
  • only the location information of the service object may be trained according to actual needs, or only the confidence information of the service object may be trained, and the location information and the confidence information may be trained.
  • the training of the location information and the confidence information enables the trained second convolutional network model to more effectively and accurately determine the location information and confidence information of the business object, so as to provide a basis for the display of the business object.
  • the second convolutional network model is trained by a large number of sample images.
  • the second convolutional network model can be trained using the sample image including the business object, and those skilled in the art should understand that
  • the sample image of the training in addition to the business object, should also contain information on the state of the face action (ie, information for determining the face action of the face).
  • the business object in the sample image in the embodiment of the present application may pre-label the location information, or the confidence information, or the location information and the confidence information. Of course, in practical applications, this information can also be obtained through other means. By marking the corresponding information on the business object in advance, the data processing data and the number of interactions can be effectively saved, and the data processing efficiency is improved.
  • the location information and/or confidence information of the business object and the sample image of a certain face attribute are used as training samples, and the feature vector is extracted to obtain the feature information including the location information and/or the confidence information of the business object.
  • the vector, and the face feature vector corresponding to the face feature point are used as training samples, and the feature vector is extracted to obtain the feature information including the location information and/or the confidence information of the business object.
  • the second convolutional network model can be used to simultaneously train the face and the business object.
  • the feature vector of the sample image also includes the features of the face.
  • the obtained feature vector convolution result includes the location information and/or the confidence information of the business object, and the feature vector convolution result corresponding to the face feature vector corresponding to the face action state.
  • the feature vector convolution result also contains information on the state of the face action.
  • the number of times of convolution processing on the feature vector can be set according to actual needs, that is, in the second convolutional network model, the number of layers of the convolution layer is set according to actual needs, and details are not described herein again.
  • the convolution result is the result of feature extraction of the feature vector, which can effectively represent the business object corresponding to the feature of the face in the video image.
  • the feature vector when the feature vector includes both the location information of the service object and the confidence information of the service object, that is, when the location information and the confidence information of the service object are trained,
  • the eigenvector convolution result is shared in the subsequent judgment of the convergence condition, and no need to perform repeated processing and calculation, which is beneficial to reduce the resource loss caused by data processing, and is beneficial to improve data processing speed and efficiency.
  • the convergence condition is appropriately set by a person skilled in the art according to actual needs.
  • the information satisfies the convergence condition, it can be considered that the network parameters in the second convolutional network model are properly set; when the information cannot satisfy the convergence condition, it can be considered that the network parameters in the second convolutional network model are not properly set and need to be performed.
  • the adjustment may be an iterative process until the result of convolution processing the feature vector using the adjusted network parameters satisfies the convergence condition.
  • the convergence condition may be set according to a preset standard location and/or a preset standard confidence, for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold The convergence condition of the confidence information as a business object, and the like.
  • a preset standard location and/or a preset standard confidence for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold.
  • the preset standard location may be an average position obtained by averaging the positions of the service objects in the sample image to be trained; the preset standard confidence may be a business object in the sample image to be trained.
  • the confidence of the average confidence obtained after averaging processing Since the sample image is a sample to be trained and the amount of data is large, the standard position and/or standard confidence can be set according to the position and/or confidence of the business object in the sample image to be trained, so as to set the standard position and standard confidence. The degree is also more objective and precise.
  • an optional manner includes:
  • the confidence information of the corresponding service object in the feature vector convolution result Obtaining the confidence information of the corresponding service object in the feature vector convolution result, calculating the Euclidean distance between the confidence level indicated by the confidence information of the corresponding service object and the preset standard confidence, and obtaining the confidence of the corresponding business object.
  • the Euclidean distance method is adopted, and the implementation is simple and can effectively indicate whether the convergence condition is satisfied.
  • the embodiment of the present application is not limited thereto, and other methods such as a horse distance, a bar distance, and the like may also be adopted.
  • the preset standard position is an average position obtained by averaging the positions of the business objects in the sample image to be trained; and/or, the preset standard confidence is the sample to be trained.
  • the method for determining the face convergence condition that the corresponding face feature vector is satisfied in the feature vector convolution result can be set by a person skilled in the art according to the actual situation, which is not limited by the embodiment of the present application.
  • the second convolutional network model is completed. Training; otherwise, as long as there is a convergence condition that is not satisfied, for example, the location information and/or confidence information of the business object does not satisfy the convergence condition of the business object, and/or the face feature vector does not satisfy the face convergence condition, adjust the second volume. Network parameters of the network model, and iteratively training the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the location information and/or confidence information of the service object after the iterative training and The face feature vectors satisfy the corresponding convergence conditions.
  • the second convolutional network model can feature extracting and classifying the presentation position of the business object based on the face presentation, thereby having the position of determining the presentation position of the business object in the video image.
  • the second convolutional network model may also determine the order of the presentation effects in the plurality of presentation locations, thereby determining the final presentation location. In subsequent applications, when a business object needs to be displayed, a valid presentation location can be determined based on the current image in the video.
  • the sample image may be pre-processed, for example, may include: acquiring a plurality of sample images, where each sample image includes annotation information of the business object; Determining the location of the business object according to the annotation information, determining whether the distance between the determined location of the business object and the preset location is less than or equal to a set threshold; The sample image corresponding to the business object whose threshold is set is determined as the sample image to be trained.
  • the preset position and the set threshold may be appropriately set by any suitable means by a person skilled in the art, for example, according to the statistical analysis result of the data or the related distance calculation formula or the artificial experience, etc., which is not limited by the embodiment of the present application.
  • the sample image that does not meet the conditions can be filtered out to improve the accuracy of the training result.
  • the training of the second convolutional network model is implemented by the above process, and the trained second convolutional network model can be used to determine the presentation position of the business object in the video image.
  • the optimal display service object may be indicated.
  • the location such as the forehead position of the anchor, controls the live application to display the business object at the location; or, during the live broadcast of the video, if the anchor clicks on the business object to indicate the display of the business object, the second convolutional network model can directly be based on the live video.
  • the image determines the presentation location of the business object.
  • the presentation position of the business object to be presented may be determined according to the set rules.
  • determining the presentation position of the business object to be presented may include, for example, at least one or any of the following: a hair region of the character in the video image, a forehead region, a cheek region, a chin region, a body region other than the head, and a video image.
  • the preset area may be appropriately set according to an actual situation, for example, an area within a setting range centering on a face area, or an area within a setting range other than a face area, or a background area, or the like.
  • the embodiment of the present application does not limit this.
  • the presentation location of the business object to be presented in the video image may be further determined.
  • the center point of the presentation area corresponding to the presentation location is used to display the business object as the center point of the presentation location; for example, a certain coordinate position in the presentation area corresponding to the presentation location is determined as the center point of the presentation location, etc. This embodiment of the present application does not limit this.
  • the service to be displayed may be determined not only according to the face feature point but also according to the type of the business object to be presented.
  • the type of the business object includes at least one of the following or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, a virtual headwear type, and a virtual hair. Ornament type, virtual jewelry type, background type, virtual pet type, virtual container type.
  • the type of the business object may be other suitable types, such as a virtual cap type, a virtual cup type, a text type, and the like.
  • an appropriate presentation position can be selected for the business object with reference to the face feature point.
  • At least one presentation position may be selected from the plurality of presentation positions as the to-be-position
  • the location of the presented business object in the video image For example, for a text type business object, it can be displayed in the background area, or it can be displayed on the person's forehead or body area.
  • the correspondence between the facial motion and the presentation position may be stored in advance, and when the detected facial motion is matched with the corresponding predetermined facial motion, the pre-stored facial motion may be In the correspondence with the presentation position, the target presentation position corresponding to the predetermined facial motion is acquired as the presentation position of the business object to be presented in the video image.
  • the facial motion is not necessarily related to the presentation position, and the facial motion is only a way to trigger the presentation of the business object, and the position and the face are displayed.
  • the business object can be displayed in a certain area of the face, or can be displayed in other areas than the face, such as the background area of the video image.
  • step S350 the business object to be presented is drawn in a computer drawing manner at the presentation position.
  • step S350 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 503 being executed by a processor.
  • the business object is a sticker containing the semantic information, such as an advertisement sticker
  • the related information of the business object such as the identifier and size of the business object
  • the business object After determining the presentation position, the business object can be scaled, rotated, etc. according to the coordinates of the presentation position, and then, through corresponding drawing methods such as OpenGL graphics. Drawing the way the engine is drawn to the business object.
  • ads can also be displayed in 3D special effects, such as text or logos (LOGOs) that display ads through particle effects.
  • the advertising effect of a certain product can be displayed by dynamically reducing the liquid in the cup, and the advertising effect can include a plurality of display images of different states (for example, including multiple frames in which the amount of liquid in the cup is gradually reduced).
  • the video frame composed of the image is sequentially drawn in the display position by a computer drawing method such as drawing of the OpenGL graphics rendering engine, thereby displaying the dynamic effect of gradually reducing the amount of liquid in the cup.
  • the dynamic display of the advertising effect can attract viewers to watch, improve the fun of advertising and display, and improve the efficiency of advertising and display.
  • the method for processing a video image provided by the embodiment of the present application triggers the presentation of a business object (such as an advertisement) by a facial action.
  • the business object to be presented is drawn at the presentation position by using a computer drawing method, and the business object is played with the video.
  • video data of a business object such as an advertisement that is not related to the video through the network, thereby saving network resources and/or system resources of the client;
  • the business object is closely combined with the facial motion in the video image, It is beneficial to preserve the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to reducing the user's business object displayed in the video image. Resentment can attract the attention of the audience to a certain extent and improve the influence of business objects.
  • FIG. 4 is a flow chart of still another embodiment of a method of processing a video image.
  • the processing solution of the video image in the embodiment of the present application is described by taking the two-dimensional sticker special effect including the advertisement information as an example.
  • the processing method of the video image in this embodiment includes:
  • step S401 a plurality of sample images including face information are acquired as training samples, wherein the sample images include information of the labeled face action states.
  • step S401 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a training sample acquisition module 504 being executed by the processor.
  • step S402 the first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting the state of the facial motion in the image.
  • step S402 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional network model determination module 505 that is executed by the processor.
  • step S403 a feature vector of the sample image of the above training sample is acquired.
  • the feature vector includes position information and/or confidence information of the business object in the sample image of the training sample, and a face feature vector corresponding to the face action state in the sample image.
  • the face action state in each sample image may be determined when training the first convolutional network model.
  • sample images in the sample image of the training sample that do not meet the training standard of the second convolutional network model, and this part of the sample image needs to be filtered out by preprocessing the sample image.
  • each sample image includes a business object, and each business object is marked with location information and confidence information.
  • the location information of the central point of the business object is used as the location information of the business object.
  • the sample image is filtered according to the location information of the business object. After obtaining the coordinates of the location indicated by the location information, the coordinates are compared with the preset location coordinates of the business object of the type, and the position variance of the two is calculated. If the position variance is less than or equal to the set threshold, the sample image may be used as the sample image to be trained; if the position variance is greater than the set threshold, the sample image is filtered out.
  • the preset position coordinates and the set thresholds may be appropriately set by a person skilled in the art according to actual conditions.
  • the images generally used for the second convolutional network model training have the same size, so the set threshold may be
  • the length or width of the image is 1/20 to 1/5, and alternatively, it may be 1/10 of the length or width of the image.
  • the location and confidence of the business objects in the determined sample image may be averaged to obtain an average position and an average confidence, which may be used as a basis for determining the convergence condition subsequently.
  • the sample image used for training in this embodiment needs to be labeled with the coordinates of the optimal advertisement position and the confidence of the advertisement position.
  • the size of the confidence indicates the probability that this ad slot is the best ad slot. For example, if this ad slot is mostly occluded, the confidence is low.
  • the optimal advertisement position can be marked in a face, a front background, and the like, so that joint training of advertisement points in a facial feature point, a front background, and the like can be realized, which is separately trained with respect to a certain technique based on facial motion or the like.
  • the solution is conducive to saving computing resources.
  • step S403 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a eigenvector acquisition module 506 that is executed by the processor.
  • step S404 the feature vector is convoluted to obtain a feature vector convolution result.
  • step S404 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolution module 507 executed by the processor.
  • step S405 it is determined whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determines whether the corresponding face feature vector in the feature vector convolution result satisfies the face convergence. condition.
  • step S405 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convergence condition determination module 508 that is executed by the processor.
  • step S406 if the convergence conditions in step S405 are satisfied, that is, the position information and/or the confidence information of the corresponding business object in the feature vector convolution result satisfies the convergence condition of the business object, and the corresponding result in the feature vector convolution result If the face feature vector satisfies the face convergence condition, the training of the second convolutional network model is completed; otherwise, the network parameters of the second convolutional network model are adjusted, and according to the adjusted network parameter pair of the second convolutional network model The second convolutional network model performs iterative training until the position information and/or the confidence information and the face feature vector of the service object after the iterative training satisfy the corresponding convergence condition.
  • the location information and/or the confidence information of the corresponding service object in the feature vector convolution result if the location information and/or the confidence information of the corresponding service object in the feature vector convolution result does not satisfy the service object convergence condition, the location information of the corresponding service object in the feature vector convolution result is / or confidence information, adjust the network parameters of the second convolutional network model, and iteratively train the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the iteratively trained business object
  • the location information and/or the confidence information satisfy the convergence condition of the service object; if the corresponding face feature vector in the feature vector convolution result does not satisfy the face convergence condition, the corresponding face feature vector in the convolution result according to the feature vector Adjusting the network parameters of the second convolutional network model, and iteratively training the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the face feature vector after the iterative training satisfies the face Convergence conditions.
  • step S406 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a model training module 509 executed by the processor.
  • the trained second convolutional network model can be obtained.
  • the first convolutional network model and the second convolutional network model obtained by the above training may perform corresponding processing on the video image, and may specifically include the following steps S407 to S411.
  • step S407 the currently played video image containing the face information is acquired.
  • step S408 a face key point is extracted from the video image, and a first convolutional network model for detecting a face action state in the image is used, and the video image is determined according to the face motion state in the video image. Facial movements of the face.
  • steps S407-S408 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a video image detection module 501 that is executed by the processor.
  • step S409 when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the facial feature point in the facial region corresponding to the detected facial motion is extracted.
  • step S410 according to the face feature point, a pre-trained second convolution network model for determining the presentation position of the business object in the video image is used to determine the presentation position of the business object to be presented in the video image.
  • steps S409-S410 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 that is executed by the processor.
  • step S411 the business object to be presented is drawn in a computer drawing manner at the presentation position.
  • step S411 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the processor.
  • the running business object drawing module 503 executes.
  • the video image in the video playing process can be detected in real time, and the advertising placement position with better effect is given, and the user's viewing experience is not affected, and the delivery effect is better; by the business object and the video
  • the combination of play, no need to transmit additional advertising video data irrelevant to the video through the network, is conducive to saving network resources and/or system resources of the client;
  • the business object to be presented is drawn by computer drawing at the display position, the business object and The facial actions in the video image are closely combined, which can preserve the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and also does not disturb the user to watch the video normally, which is beneficial to reduce The user's dislike of the business objects displayed in the video image, and can attract the attention of the audience to a certain extent, and improve the influence of the business object.
  • business objects can be widely applied to other aspects, such as education, consulting, services, etc., by providing entertainment, appreciation
  • the processing of any of the video images provided by the embodiments of the present application may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device, a server, and the like.
  • the processing of any of the video images provided by the embodiments of the present application may be performed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform processing of any one of the video images mentioned in the embodiments of the present application. This will not be repeated below.
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • FIG. 5 is a schematic structural diagram of an embodiment of a processing apparatus for video images of the present application.
  • the video image processing apparatus of the embodiments of the present application can be used to implement the foregoing method for processing each video image of the present application.
  • the processing apparatus for the video image of this embodiment includes: a video image detecting module 501, a presentation position determining module 502, and a business object drawing module 503. among them:
  • the video image detecting module 501 is configured to perform facial motion detection of the face on the currently played video image including the face information.
  • the presentation location determining module 502 is configured to determine a presentation location of the business object to be presented in the video image when it is determined that the detected facial motion matches the corresponding predetermined facial motion.
  • the business object drawing module 503 is configured to draw a business object to be presented by using a computer drawing manner at the presentation position.
  • the video image processing apparatus performs facial motion detection on the currently played video image including the face information, and matches the detected facial motion with the corresponding predetermined facial motion, when the two match Determining a presentation position of the business object to be presented in the video image, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, by using a computer drawing method
  • the presentation location draws the business object to be presented, and the business object is combined with the video playback, and does not need to transmit additional advertisement video data irrelevant to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object Closely combined with the facial motion in the video image, it is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to the video. Reduce the business objects that users display on video images Disgusted, to attract the
  • the video image detecting module 501 is configured to extract a face key point from the currently played video image containing the face information, and use the pre-trained a first convolutional network model for detecting a state of motion of a face in an image, determining a state of a facial motion of a face in the video image according to the key point of the face, and determining a face of the video image according to a state of the facial motion of the facial face Facial movements.
  • FIG. 6 is a schematic structural diagram of another embodiment of a processing apparatus for video images of the present application.
  • the processing apparatus of the video image further includes:
  • the training sample obtaining module 504 is configured to acquire a plurality of sample images including face information, wherein the sample image is labeled with information of a face attribute.
  • a first convolutional network model determining module 505 configured to use the training sample to train the first convolutional network model, and obtain A first convolutional network model that detects the state of the face motion in the image.
  • the training sample obtaining module 504 includes: a sample image acquiring unit, configured to acquire a plurality of sample images including face information; and a face positioning information determining unit configured to detect, in each sample image, the sample image a face and a face key point, the face in the sample image is positioned by the face key point to obtain face positioning information; and the training sample determining unit is configured to use the sample image including the face positioning information as a training sample.
  • the presentation location determining module 502 includes: a feature point extraction unit, configured to extract a facial feature point in a face region corresponding to the detected facial motion; and a presentation location determining unit, configured to use the facial feature point according to the facial feature point Determining a presentation location of the business object to be presented in the video image.
  • the presentation location determining module 502 is configured to determine, according to the facial feature point, a pre-trained second convolutional network model for determining a presentation location of the business object in the video image, to determine the service to be presented. The presentation position of the object in the video image.
  • the video image processing apparatus of still another embodiment further includes:
  • the feature vector obtaining module 506 is configured to acquire a feature vector of the sample image of the training sample, where the feature vector includes: location information and/or confidence information of the service object in the sample image, and corresponding to the facial motion in the sample image. a face feature vector corresponding to a face feature point in a face region;
  • the convolution module 507 is configured to perform convolution processing on the feature vector to obtain a feature vector convolution result
  • the convergence condition determining module 508 is configured to determine whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determine whether the corresponding face feature vector in the feature vector convolution result is Meet the face convergence condition;
  • the model training module 509 is configured to: when the convergence conditions are satisfied, that is, the location information and/or the confidence information of the service object satisfy the convergence condition of the service object, and the face feature vector satisfies the face convergence condition, and the second volume is completed. Training of the product network model; otherwise, when the above convergence conditions are not satisfied, the network parameters of the second convolutional network model are adjusted, and the second convolution network model is based on the adjusted network parameters of the second convolutional network model The iterative training is performed until the position information and/or the confidence information of the business object after the iterative training and the face feature vector satisfy the corresponding convergence condition.
  • the presentation location determining module 502 is configured to determine, according to the facial feature point and the type of the business object to be presented, a presentation location of the business object to be presented in the video image.
  • the presentation location determining module 502 includes: a presentation location obtaining unit, configured to acquire, according to the facial feature point and the type of the business object to be presented, a plurality of presentation locations of the business object to be presented in the video image; And a presentation location selection unit, configured to select at least one presentation location from the plurality of presentation locations as a presentation location of the business object to be presented in the video image.
  • the presentation location determining module 502 is configured to obtain, from a correspondence between the pre-stored facial motion and the presentation location, a target presentation location corresponding to the predetermined facial motion as a presentation of the business object to be presented in the video image. position.
  • the business object includes: an effect including semantic information; the video image is a live video image.
  • the foregoing special effect including the semantic information includes at least one of the following special effects of the advertisement information: a two-dimensional sticker special effect, a three-dimensional special effect, a particle special effect, and the like.
  • the display location comprises at least one or any of the following: a hair area of a person in the video image, a forehead area, a cheek area, a chin area, a body area other than the head, a background area in the video image, and a video image An area within the setting range centering on the area where the hand is located, a predetermined area in the video image, and the like.
  • the type of the business object includes at least one of the following or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, a virtual headwear type, Virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type, and the like.
  • the facial motion of the face includes at least one or any of the following: blinking, kissing, opening the mouth, shaking the head, nodding, laughing, crying, frowning, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left
  • the eyeball moves to the right, turns to the left, turns to the right, looks up, bows, and pouts.
  • the electronic device can include a processor 902, a communications interface 904, a memory 906, and a communications bus 908. among them:
  • Processor 902, communication interface 904, and memory 906 complete communication with one another via communication bus 908.
  • the communication interface 904 is configured to communicate with network elements of other devices, such as other clients or servers.
  • the processor 702 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, or a graphics processor ( Graphics Processing Unit, GPU).
  • the one or more processors included in the terminal device may be the same type of processor, such as one or more CPUs, or one or more GPUs; or may be different types of processors, such as one or more CPUs and One or more GPUs.
  • the memory 906 is for at least one executable instruction that causes the processor 902 to perform operations corresponding to a method of presenting a business object in a video image as in any of the above-described embodiments of the present application.
  • the memory 906 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.
  • the embodiment of the present application further provides another electronic device, including: a processor and a video image processing apparatus according to any one of the foregoing embodiments; when the processor runs the video image processing device, The unit in the processing apparatus applying for the video image described in any of the above embodiments is operated.
  • FIG. 8 is a schematic structural diagram of another embodiment of an electronic device according to the present invention.
  • the electronic device includes one or more processors, communication units, etc., such as one or more central processing units (CPUs) 801, and/or one or more An image processor (GPU) 813 or the like, the processor may execute various kinds according to executable instructions stored in a read only memory (ROM) 802 or executable instructions loaded from the storage portion 808 into the random access memory (RAM) 803. Proper action and handling.
  • processors, communication units, etc. such as one or more central processing units (CPUs) 801, and/or one or more An image processor (GPU) 813 or the like
  • the processor may execute various kinds according to executable instructions stored in a read only memory (ROM) 802 or executable instructions loaded from the storage portion 808 into the random access memory (RAM) 803. Proper action and handling.
  • ROM read only memory
  • RAM random access memory
  • Communication portion 812 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card, and the processor can communicate with read only memory 802 and/or random access memory 803 to execute executable instructions over bus 804.
  • the communication unit 812 is connected to the communication unit 812 and communicates with other target devices to complete the operation corresponding to the processing method of any video image provided by the embodiment of the present application. For example, the currently played video image including the face information is performed. Facial motion detection of a face; determining a presentation position of a business object to be presented in the video image when the detected facial motion matches a corresponding predetermined facial motion; drawing a computer drawing manner at the presentation position The business object to be presented.
  • RAM 803 various programs and data required for the operation of the device can be stored.
  • the CPU 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
  • ROM 802 is an optional module.
  • the RAM 803 stores executable instructions or writes executable instructions to the ROM 802 at runtime, the executable instructions causing the processor 801 to perform operations corresponding to the processing methods of the video images described above.
  • An input/output (I/O) interface 805 is also coupled to bus 804.
  • the communication unit 812 may be integrated or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards) and on the bus link.
  • the following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, etc.; an output portion 807 including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 808 including a hard disk or the like. And a communication portion 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the Internet.
  • Driver 811 is also connected to I/O interface 805 as needed.
  • a removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like, is mounted on the drive 811 as needed so that a computer program read therefrom is installed into the storage portion 808 as needed.
  • FIG. 8 is only an optional implementation manner.
  • the number and types of the components in FIG. 8 may be selected, deleted, added, or replaced according to actual needs; Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more. These alternative embodiments are all within the scope of the present disclosure.
  • an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing an instruction corresponding to the method step provided by the embodiment of the present application, for example, performing a facial motion detection of a face on a currently played video image including face information; and when the detected facial motion matches a corresponding predetermined facial motion, Determining a presentation location of the business object to be presented in the video image; The business object to be presented is drawn in a computer drawing manner at the presentation position.
  • the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device performs the implementation of any embodiment of the present application. Instructions for each step in the processing method of the video image.
  • the embodiment of the present application further provides a computer readable storage medium for storing computer readable instructions, which are executed to implement the operations of the steps in the video image processing method of any embodiment of the present application.
  • the above method according to the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network.
  • a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network.
  • the computer code originally stored in a remote recording medium or non-transitory machine readable medium and to be stored in a local recording medium, whereby the methods described herein can be stored using a general purpose computer, a dedicated processor, or programmable or dedicated Such software processing on a recording medium of hardware such as an ASIC or an FPGA.
  • a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is The processing methods described herein are implemented when the processor or hardware is accessed and executed. Moreover, when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code converts the general purpose computer into a special purpose computer for performing the processing shown herein.

Abstract

Provided in the embodiments of the present application are a video image processing method, apparatus and terminal device. The method comprises: performing human face facial movement detection on a video image currently being played back which contains human face information; determining a presentation position of a service object to be presented in the video image when a detected facial movement matches a corresponding predetermined facial movement; and drawing the service object to be presented in the presentation position using a computer drawing mode. Using the embodiments of the present application may save network resources and/or system resources of a client, make a video image more interesting, and avoid bothering a user when normally watching a video, thereby reducing the user's feelings of opposition to a service object presented in the video image, while attracting the attention of an audience to a certain extent, and increasing the impact of the service object.

Description

视频图像的处理方法、装置和电子设备Video image processing method, device and electronic device
本申请要求在2016年08月19日提交中国专利局、申请号为CN201610697502.3、发明名称为“视频图像的处理方法、装置和终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application, filed on Aug. 19, 2016, with the application number CN201610697502.3, and the invention entitled "Processing method, device and terminal device for video images", the entire contents of which are incorporated by reference. Combined in this application.
技术领域Technical field
本申请涉及人工智能技术,尤其涉及一种视频图像的处理方法、装置和电子设备。The present application relates to artificial intelligence technology, and in particular, to a video image processing method, apparatus, and electronic device.
背景技术Background technique
随着互联网技术的发展,人们越来越多地使用互联网观看视频,互联网视频为许多新的业务提供了商机。由于可以成为重要的业务流量入口,互联网视频被认为是广告植入的优质资源。With the development of Internet technology, people are increasingly using the Internet to watch video, and Internet video offers business opportunities for many new businesses. Internet video is considered a premium resource for ad placement because it can be an important entry point for business traffic.
现有视频广告主要通过植入的方式,在视频播放的某个时间插入固定时长的广告,或在视频播放的区域及其周边区域固定位置放置广告。Existing video advertisements are mainly inserted into a fixed-time advertisement at a certain time of video playback, or placed in a fixed position in the area where the video is played and its surrounding area.
发明内容Summary of the invention
本申请实施例提供一种视频图像的处理的方案。The embodiment of the present application provides a solution for processing a video image.
根据本申请实施例的一方面,提供一种视频图像的处理方法,包括,对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测;当检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在所述视频图像中的展现位置;在所述展现位置采用计算机绘图方式绘制所述待展现的业务对象。According to an aspect of the embodiments of the present application, a method for processing a video image includes: performing facial motion detection of a face on a currently played video image including face information; and detecting a facial motion and a corresponding predetermined facial surface When the actions match, the presentation position of the business object to be presented in the video image is determined; and the business object to be presented is drawn by using a computer drawing manner at the presentation position.
根据本申请实施例的另一方面,提供一种视频图像的处理装置,包括:视频图像检测模块,用于对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测;展现位置确定模块,用于当所述视频图像检测模块检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在所述视频图像中的展现位置;业务对象绘制模块,在所述展现位置采用计算机绘图方式绘制所述待展现的业务对象。According to another aspect of the embodiments of the present application, a video image processing apparatus is provided, including: a video image detecting module, configured to perform facial motion detection of a face on a currently played video image including face information; and display position determination a module, configured to determine a presentation position of a business object to be presented in the video image when a facial motion detected by the video image detecting module matches a corresponding predetermined facial motion; a business object rendering module, The presentation location draws the business object to be presented in a computer drawing manner.
根据本申请实施例的又一方面,提供一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行本申请上述任一实施例所述的视频图像的处理方法对应的操作。According to still another aspect of the embodiments of the present application, an electronic device includes: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete each other through the communication bus The memory is configured to store at least one executable instruction, the executable instruction causing the processor to perform an operation corresponding to the processing method of the video image according to any of the above embodiments of the present application.
根据本申请实施例的再一方面,提供另一种电子设备,包括:According to still another aspect of the embodiments of the present application, another electronic device is provided, including:
处理器和本申请上述任一实施例所述的视频图像的处理装置;A processor and a video image processing apparatus according to any of the above embodiments of the present application;
在处理器运行所述视频图像的处理装置时,本申请上述任一实施例所述的视频图像的处理装置中的单元被运行。When the processor runs the processing device of the video image, the unit in the video image processing apparatus according to any of the above embodiments of the present application is executed.
根据本申请实施例的再一方面,提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请上述任一实施例所述的视频图像的处理方法中各步骤的指令。According to still another aspect of an embodiment of the present application, a computer program is provided, comprising computer readable code, when a computer readable code is run on a device, a processor in the device performs the above-described An instruction of each step in the method of processing a video image according to an embodiment.
根据本申请实施例的又一方面,还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请上述任一实施例所述的视频图像的处理方法中各步骤的操作。According to still another aspect of the embodiments of the present application, a computer readable storage medium is provided for storing computer readable instructions, when executed, to implement the video image described in any of the above embodiments of the present application. The operation of each step in the processing method.
根据本申请实施例提供的视频图像的处理方法、装置和电子设备,通过对当前播放的包含人脸信息的视频图像进行面部动作检测,并将检测到的面部动作与对应的预定面部动作进行匹配,当两者相匹配时,确定待展现的业务对象在视频图像中的展现位置,在该展现位置采用计算机绘图方式绘制待展现的业务对象,这样当业务对象用于展示广告时,一方面,由于采用计算机绘图方式在展现位置绘制待展现的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的面部动作紧 密结合,既有利于保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,还不会打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,能够在一定程度上吸引观众的注意力,提高业务对象的影响力。According to the method, device, and electronic device for processing a video image according to an embodiment of the present application, facial motion detection is performed on a currently-played video image including face information, and the detected facial motion is matched with a corresponding predetermined facial motion. Determining a presentation position of the business object to be presented in the video image when the two match, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, Since the business object to be presented is drawn in the presentation position by using the computer drawing method, the business object is combined with the video playing, and no additional advertising video data irrelevant to the video is transmitted through the network, which is beneficial to saving network resources and/or system resources of the client; On the other hand, business objects and facial movements in video images are tight The combination of secrets not only preserves the main image and motion of the video subject (such as the anchor) in the video image, but also adds interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to reducing the user's display on the video image. The objection of the business object can attract the attention of the audience to a certain extent and improve the influence of the business object.
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。The technical solutions of the present application are further described in detail below through the accompanying drawings and embodiments.
附图说明DRAWINGS
构成说明书的一部分的附图描述了本申请的实施例,并且连同描述一起用于解释本申请的原理。The accompanying drawings, which are incorporated in FIG.
参照附图,根据下面的详细描述,可以更加清楚地理解本申请,其中:The present application can be more clearly understood from the following detailed description, in which:
图1是本申请视频图像的处理方法一实施例的流程图;1 is a flow chart of an embodiment of a method for processing a video image of the present application;
图2是本申请实施例中第一卷积网络模型的获取方法一实施例的流程图;2 is a flowchart of an embodiment of a method for acquiring a first convolutional network model in an embodiment of the present application;
图3是本申请视频图像的处理方法另一实施例的流程图;3 is a flow chart of another embodiment of a method for processing a video image of the present application;
图4是本申请视频图像的处理方法又一实施例的流程图;4 is a flow chart of still another embodiment of a method for processing a video image of the present application;
图5是本申请视频图像的处理装置一实施例的结构示意图;FIG. 5 is a schematic structural diagram of an embodiment of a processing apparatus for video images of the present application; FIG.
图6是本申请视频图像的处理装置另一实施例的结构示意图;6 is a schematic structural diagram of another embodiment of a processing apparatus for video images of the present application;
图7是本申请电子设备一实施例的结构示意图;7 is a schematic structural diagram of an embodiment of an electronic device of the present application;
图8是本申请电子设备另一实施例的结构示意图。FIG. 8 is a schematic structural diagram of another embodiment of an electronic device of the present application.
具体实施方式detailed description
现在将参照附图来详细描述本申请的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。Various exemplary embodiments of the present application will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the application.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本申请及其应用或使用的任何限制。The following description of the at least one exemplary embodiment is merely illustrative and is in no way
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but the techniques, methods and apparatus should be considered as part of the specification, where appropriate.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.
本领域技术人员可以理解,本发明实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。Those skilled in the art can understand that the terms “first”, “second” and the like in the embodiments of the present invention are only used to distinguish different steps, devices or modules, etc., and do not represent any specific technical meaning or between them. The inevitable logical order.
本申请实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.
图1是本申请视频图像的处理方法一实施例的流程图。本申请各实施例的视频图像的处理方法可以示例性地通过计算机系统、终端设备、服务器等电子设备执行。参照图1,该实施例视频图像的处理方法包括:1 is a flow chart of an embodiment of a method for processing a video image of the present application. The processing method of the video image of each embodiment of the present application may be exemplarily executed by an electronic device such as a computer system, a terminal device, or a server. Referring to FIG. 1, a method for processing a video image of this embodiment includes:
在步骤S110,对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测。 In step S110, face motion detection of the face is performed on the currently played video image including the face information.
本申请各实施例中,面部动作例如可以包括但不限于:眨眼、张嘴、点头和嘟嘴等。人脸信息例如可以包括:与面部、眼睛、嘴部、鼻子和/或头发等相关的信息。视频图像可以是正在直播的直播类视频中的图像,也可以是已录制完成或者正在录制过程中的视频图像等。In various embodiments of the present application, facial actions may include, but are not limited to, blinking, opening, nodding, and pouting. The face information may include, for example, information related to the face, eyes, mouth, nose, and/or hair, and the like. The video image may be an image in a live video that is being broadcast live, or a video image that has been recorded or is in the process of being recorded.
在本申请各实施例的其中一个可选示例中,以直播类视频为例,目前,视频直播平台包括多个,如花椒直播平台、YY直播平台等,每一个直播平台包括有多个直播房间,每个直播房间中会包括至少一个主播,主播可以通过电子设备的摄像头向其所在的直播房间中的粉丝直播视频,该直播类视频包括多个视频图像。上述视频图像中的主体通常为一个主要人物(即主播)和简单的背景,主播常常在视频图像中所占的区域较大。当需要在直播视频的过程中插入业务对象(如广告等)时,可以获取当前直播类视频中的视频图像,可以通过预先设置的人脸检测机制对该视频图像进行人脸检测,以判断该视频图像中是否包括主播的人脸信息,如果包括主播的人脸信息,则获取或记录该视频图像,以进行后续处理;如果不包括主播的人脸信息,则可以继续对下一帧视频图像执行上述相关处理,以得到视频图像包括主播的人脸信息的视频图像。In an optional example of the embodiments of the present application, a live video is taken as an example. Currently, the live video platform includes multiple, such as a pepper live broadcast platform, a YY live broadcast platform, etc., each live broadcast platform includes multiple live broadcast rooms. Each live room will include at least one anchor, and the anchor can broadcast a video to the fans in the live room where the electronic device is located, the live video includes multiple video images. The subject in the above video image is usually a main character (ie, anchor) and a simple background, and the anchor often occupies a larger area in the video image. When a business object (such as an advertisement) needs to be inserted in the process of the live video, the video image in the current live video can be obtained, and the video image can be detected by a preset face detection mechanism to determine the Whether the face information of the anchor is included in the video image, if the face information of the anchor is included, the video image is acquired or recorded for subsequent processing; if the face information of the anchor is not included, the video image of the next frame may be continued. The above related processing is performed to obtain a video image in which the video image includes the face information of the anchor.
此外,视频图像还可以是已录制完成的短视频中的视频图像,对于此种情况,用户可以使用其电子设备播放该短视频,在播放的过程中,电子设备可以检测每一帧视频图像中是否包括主播的人脸信息,如果包括主播的人脸信息,则获取该视频图像,以进行后续处理;如果不包括主播的人脸信息,则可以丢弃该视频图像或者不对该视频图像做任何处理,并获取下一帧视频图像继续进行上述处理。In addition, the video image may also be a video image in a short video that has been recorded. In this case, the user can play the short video using the electronic device, and during the playing process, the electronic device can detect each frame of the video image. Whether the face information of the anchor is included, if the face information of the anchor is included, the video image is acquired for subsequent processing; if the face information of the anchor is not included, the video image may be discarded or not processed for the video image. And acquire the next frame of video image to continue the above processing.
另外,对于视频图像是正在录制过程中的视频图像的情况,在录制的过程中,用户可以使用其电子设备检测录制的每一帧视频图像中是否包括主播的人脸信息,如果包括主播的人脸信息,则获取该视频图像,以进行后续处理;如果不包括主播的人脸信息,则可以丢弃该视频图像或者不对该视频图像做任何处理,并获取下一帧视频图像继续进行上述处理。In addition, in the case that the video image is a video image being recorded, during the recording process, the user can use his electronic device to detect whether the video image of each frame is included in the recorded video image of the anchor, if the person including the anchor For the face information, the video image is acquired for subsequent processing; if the face information of the anchor is not included, the video image may be discarded or not processed, and the next frame of the video image may be acquired to continue the above processing.
播放视频图像的电子设备或者主播使用的电子设备中设置有对视频图像进行人脸的面部动作检测的机制,通过该面部动作检测的机制可以对当前播放的包括人脸信息的视频图像进行面部动作检测,得到从视频图像中检测到的人脸的面部动作,一种可选的处理过程可以为,电子设备获取当前正在播放的一帧视频图像,通过预先设定的机制可以从该视频图像中截取出包括人脸区域的图像,然后,可以对人脸区域的图像进行分析和特征提取,得到人脸区域中各个部位(包括眼睛、嘴和面部等)的特征数据,通过对该特征数据的分析,确定视频图像中人脸的面部动作的含义,即:面部动作属于眨眼、闭左眼、闭右眼、闭双眼、眼珠向左运动、眼珠向右运动、向左转头、向右转头、仰头、低头、点头、笑、哭、皱眉、张嘴、点头或嘟嘴等动作中的哪一种或哪几种。The electronic device that plays the video image or the electronic device used by the anchor is provided with a mechanism for detecting the facial motion of the face of the video image, and the facial motion detection mechanism can perform facial action on the currently played video image including the face information. Detecting, obtaining a facial motion of the face detected from the video image, an optional processing procedure may be: the electronic device acquires a video image currently being played, and may be from the video image through a preset mechanism The image including the face region is cut out, and then the image of the face region can be analyzed and feature extracted, and the feature data of each part (including the eyes, the mouth, the face, and the like) in the face region is obtained, by using the feature data. Analysis, determine the meaning of the facial motion of the face in the video image, that is, the facial motion belongs to blinking, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left, moving the eye to the right, turning to the left, turning to the right Which one or which of the head, head, head down, nodding, laughing, crying, frowning, opening your mouth, nodding or pouting .
在一个可选示例中,步骤S110可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的视频图像检测模块501执行。In an alternative example, step S110 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by video image detection module 501 being executed by the processor.
在步骤S120,当检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在视频图像中的展现位置。In step S120, when the detected facial motion matches the corresponding predetermined facial motion, the presentation position of the business object to be presented in the video image is determined.
本申请各实施例中,业务对象是根据一定的业务需求而创建的对象,例如可包括但不限于广告、娱乐、天气预报、交通预报、宠物等方面的相关信息。展现位置可以是视频图像中指定区域的中心位置,或者可以是上述指定区域中多个边缘位置的坐标等。In various embodiments of the present application, the business object is an object created according to a certain business requirement, and may include, for example, but not limited to, advertisement, entertainment, weather forecast, traffic forecast, pet, and the like. The presentation position may be a center position of a designated area in the video image, or may be a coordinate of a plurality of edge positions in the specified area or the like.
在本申请各实施例的其中一个可选示例中,可以预先存储多种不同的面部动作的特征数据,并对不同的面部动作进行相应的标记,以区分各个面部动作所代表的含义。通过上述步骤S110的处理可以从视频图像中检测到人脸的面部动作,可以将检测到的人脸的面部动作的特征数据分别与预先存储的每一种面部动作的特征数据进行比对,如果预先存储的多种不同的面部动作的特征数据中包括与检测到人脸的面部动作的特征数据匹配的面部动作,则可以确定检测到的面部动作与对应的预定面部动作相匹配。In an optional example of various embodiments of the present application, feature data of a plurality of different facial actions may be pre-stored, and different facial actions are marked correspondingly to distinguish the meaning represented by each facial action. Through the processing of the above step S110, the facial motion of the facial face can be detected from the video image, and the feature data of the detected facial motion of the facial face can be compared with the feature data of each facial motion stored in advance, respectively. The feature data of the plurality of different facial actions stored in advance includes a facial motion that matches the feature data of the facial motion of the detected face, and then the detected facial motion can be determined to match the corresponding predetermined facial motion.
为了提高匹配的准确度,可以通过计算的方式确定上述匹配结果,例如,可以设置匹配算法计算任意两个面部动作的特征数据之间的匹配度,例如,可以使用设置的匹配算法,对检测到人脸的面部动作的特征数据与预先存储的任一种面部动作的特征数据进行匹配计算,得到两者之间的匹配度数值,通过上述方式分别计算得到检测到的人脸的面部动作的特征数据与预先存储的每一种面部动作的特征 数据之间的匹配度数值,从得到的匹配度数值中选取最大的匹配度数值,如果该最大的匹配度数值超过预定的匹配阈值,则可以确定最大的匹配度数值对应的预先存储的面部动作与检测到的面部动作相匹配。如果该最大的匹配度数值未超过预定的匹配阈值,则匹配失败,即:检测到的面部动作不是预定面部动作,此时,可以继续对后续视频图像执行上述步骤S110的处理。In order to improve the accuracy of the matching, the matching result may be determined by calculation. For example, a matching algorithm may be set to calculate the matching degree between the feature data of any two facial actions. For example, the matching algorithm may be used to detect The feature data of the facial motion of the face is matched with the feature data of any type of facial motion stored in advance, and the matching degree value between the two is obtained, and the characteristics of the detected facial motion of the face are respectively calculated by the above manner. Data and characteristics of each type of facial action stored in advance The value of the matching degree between the data, the largest matching degree value is selected from the obtained matching degree value, and if the maximum matching degree value exceeds the predetermined matching threshold, the pre-stored facial action corresponding to the largest matching degree value may be determined. Matches the detected facial movements. If the maximum matching degree value does not exceed the predetermined matching threshold, the matching fails, that is, the detected facial motion is not a predetermined facial motion, and at this time, the processing of the above step S110 may be continued on the subsequent video image.
可选地,当确定检测到的面部动作与对应的预定面部动作相匹配时,可以先确定匹配到的面部动作所代表的含义,可以在预先设定的多个展现位置中选取与其含义相关或相应的展现位置,作为待展现的业务对象在视频图像中的展现位置。例如,以直播类视频为例,当检测到主播进行嘟嘴的面部动作时,可以将嘴部区域选取为与其相关或相应的展现位置。Optionally, when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the meaning represented by the matched facial motion may be determined first, and the meaning of the matched facial motion may be selected in a plurality of preset display positions or The corresponding presentation position is used as the presentation position of the business object to be presented in the video image. For example, taking a live video as an example, when detecting that the anchor performs a facial motion of the beep, the mouth region may be selected as a related or related presentation position.
在一个可选示例中,步骤S120可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的展现位置确定模块502执行。In an alternative example, step S120 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 being executed by the processor.
在步骤S130,在展现位置采用计算机绘图方式绘制待展现的业务对象。In step S130, the business object to be presented is drawn in a computer drawing manner at the presentation position.
需要说明的是,为了增加业务对象的视觉效果,提高视频图像的趣味性,可以为业务对象设置动态效果,例如,业务对象可以以一段视频的方式呈现,或者,可以由多张展示图像通过动态展示的方式呈现等。It should be noted that in order to increase the visual effect of the business object and improve the interest of the video image, dynamic effects can be set for the business object. For example, the business object can be presented as a video, or can be dynamically displayed by multiple images. The way the presentation is presented, etc.
例如,以直播类视频为例,当检测到主播进行张嘴的面部动作时,可以在视频图像中主播的嘴部所在的区域内绘制相应的业务对象,例如带有预定商品标识的广告图像等,如果粉丝对该业务对象感兴趣,则可以点击该业务对象所在的区域,粉丝的电子设备可以获取该业务对象对应的网络链接,并通过该网络链接进入与该业务对象相关的页面,粉丝可以在该页面中获取与该业务对象相关的资源。For example, taking a live video as an example, when detecting the facial motion of the anchor opening mouth, the corresponding business object, such as an advertisement image with a predetermined product identifier, may be drawn in the area where the mouth of the anchor of the video image is located, for example. If the fan is interested in the business object, the user can click on the area where the business object is located, and the electronic device of the fan can obtain the network link corresponding to the business object, and enter the page related to the business object through the network link, and the fan can This page gets the resources related to this business object.
在本申请各实施例的一个可选示例中,可以采用计算机绘图方式绘制业务对象,可以通过适当的计算机图形图像绘制或渲染等方式实现,例如可以包括但不限于:基于开放图形语言(OpenGL)图形绘制引擎进行绘制等。OpenGL定义了一个跨编程语言、跨平台的编程接口规格的专业的图形程序接口,其与硬件无关,可以方便地进行二维(2D)或三维(3D)图形图像的绘制。通过OpenGL图形绘制引擎,不仅可以实现2D效果如2D贴纸的绘制,还可以实现3D特效的绘制及粒子特效的绘制等等。但本申请不限于基于OpenGL图形绘制引擎的绘制方式,还可以采取其它方式,例如基于游戏引擎(Unity)或开放运算语言(Open Computing Language,OpenCL)等图形绘制引擎的绘制方式也同样适用于本申请各实施例。In an optional example of the embodiments of the present application, the business object may be drawn by using a computer drawing manner, and may be implemented by appropriate computer graphics image drawing or rendering, for example, but not limited to: based on Open Graphics Language (OpenGL). The graphics drawing engine draws and so on. OpenGL defines a professional graphical program interface for cross-programming language and cross-platform programming interface specifications. It is hardware-independent and can easily draw two-dimensional (2D) or three-dimensional (3D) graphics images. Through the OpenGL graphics rendering engine, not only can 2D effects be drawn, but also 3D stickers can be drawn, and 3D effects can be drawn and particle effects can be drawn. However, the application is not limited to the drawing method based on the OpenGL graphics rendering engine, and other methods may be adopted. For example, the drawing method based on the graphics engine (Unity) or the Open Computing Language (OpenCL) is also applicable to the present invention. Apply for each embodiment.
在一个可选示例中,步骤S130可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的业务对象绘制模块503执行。In an alternative example, step S130 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 503 being executed by a processor.
本申请实施例提供的视频图像的处理方法,通过对当前播放的包含人脸信息的视频图像进行面部动作检测,并将检测到的面部动作与对应的预定面部动作进行匹配,当两者相匹配时,确定待展现的业务对象在视频图像中的展现位置,在该展现位置采用计算机绘图方式绘制待展现的业务对象,这样当业务对象用于展示广告时,一方面,由于采用计算机绘图方式在展现位置绘制待展现的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的面部动作紧密结合,既有利于保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,还不会打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,能够在一定程度上吸引观众的注意力,提高业务对象的影响力。The video image processing method provided by the embodiment of the present application performs facial motion detection on the currently played video image including the face information, and matches the detected facial motion with the corresponding predetermined facial motion, when the two match Determining a presentation position of the business object to be presented in the video image, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, by using a computer drawing method The presentation location draws the business object to be presented, and the business object is combined with the video playback, and does not need to transmit additional advertisement video data irrelevant to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object Closely combined with the facial motion in the video image, it is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to the video. Reduce the business objects that users display on video images Disgusted, to attract the viewer's attention to a certain extent, increase the influence of business objects.
在本申请上述实施例一中,步骤S110的对视频图像进行人脸的面部动作检测的处理,可以采用相应的特征提取算法、或者使用神经网络模型如卷积网络模型等实现。本实施例中以卷积网络模型为例,对视频图像进行人脸的面部动作检测,为此,可以预先训练用于检测图像中人脸动作状态的第一卷积网络模型。图2是本申请实施例中第一卷积网络模型的获取方法一个实施例的流程图。本实施例的第一卷积网络模型的获取方法可以由任意具有数据采集、处理和传输功能的设备执行,包括但不限于移动终端和个人计算机(PC)等,本申请实施对此不做限定。为了对第一卷积网络模型进行训练,可以通过多种方式获取训练样本,该训练样本可以是多张包括人脸信息的样本图像,该样本图像中标注有人脸动作状态的信息,该人脸动作状态的信息用于确定人脸的面部动作。参照图2,本实施例第一卷积 网络模型的获取方法包括:In the first embodiment of the present application, the process of detecting the facial motion of the face of the video image in step S110 may be implemented by using a corresponding feature extraction algorithm or using a neural network model such as a convolutional network model. In this embodiment, the convolutional network model is taken as an example to perform facial motion detection on a video image. For this purpose, a first convolutional network model for detecting a facial motion state in the image may be pre-trained. 2 is a flow chart of an embodiment of a method for acquiring a first convolutional network model in an embodiment of the present application. The method for acquiring the first convolutional network model of the embodiment may be performed by any device having the functions of data collection, processing, and transmission, including but not limited to a mobile terminal and a personal computer (PC). . In order to train the first convolutional network model, the training sample may be obtained in a plurality of manners, and the training sample may be a plurality of sample images including face information, and the information of the face motion state is marked in the sample image, the face The information of the action state is used to determine the facial motion of the face. Referring to Figure 2, the first convolution of this embodiment The methods for obtaining the network model include:
在步骤S210,获取多张包括人脸信息的样本图像,其中,样本图像标注有人脸动作状态的信息。In step S210, a plurality of sample images including face information are acquired, wherein the sample images are labeled with information of the face action state.
本申请各实施例中,人脸信息可包括局部属性信息和全局属性信息等,其中,局部属性信息例如可以包括但不限于:头发颜色、头发长短、眉毛长短、眉毛浓密或稀疏、眼睛大小、眼睛睁开或闭合、鼻梁高低、嘴巴大小、嘴巴张开或闭合、是否佩戴眼镜、是否戴口罩等,全局属性信息例如可以包括但不限于:人种、性别、年龄等。样本图像可以是视频中图像或连续拍摄的多张图像,也可以是任意图像,可包括包含人脸的图像和不包含人脸的图像等。人脸动作状态即人脸当前的动作状态,例如可以包括但不限于:面部动作属于眨眼、闭左眼、闭右眼、闭双眼、眼珠向左运动、眼珠向右运动、向左转头、向右转头、仰头、低头、点头、笑、哭、皱眉、张嘴、点头或嘟嘴等。In various embodiments of the present application, the face information may include local attribute information and global attribute information, etc., wherein the local attribute information may include, for example but not limited to: hair color, hair length, eyebrow length, eyebrow thick or sparse, eye size, The global attribute information may include, for example, but not limited to: race, gender, age, etc., for the eyes to open or close, the height of the nose, the size of the mouth, the opening or closing of the mouth, whether to wear glasses, whether to wear a mask, or the like. The sample image may be an image in a video or a plurality of images continuously photographed, or may be any image, and may include an image including a human face and an image including no human face. The face action state, that is, the current action state of the face, for example, may include, but is not limited to, the face action belongs to blinking, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left, moving the eye to the right, turning to the left, Turn your head to the right, look up, bow your head, nod, laugh, cry, frown, open your mouth, nod or pout.
在本申请实施例中,由于图像的分辨率越大其数据量也就越大,进行人脸动作状态检测时,所需要的计算资源越多,检测速度越慢,鉴于此,在本申请的一种可选的实现方式中,上述样本图像可以是满足预设分辨率条件的图像。例如,上述预设分辨率条件可以是:图像的最长边不超过640个像素点,最短边不超过480个像素点等等。In the embodiment of the present application, the larger the resolution of the image is, the larger the amount of data is. When the face action state is detected, the more computing resources are required, the slower the detection speed. In view of this, in the present application, In an optional implementation manner, the sample image may be an image that satisfies a preset resolution condition. For example, the above preset resolution condition may be: the longest side of the image does not exceed 640 pixels, the shortest side does not exceed 480 pixels, and the like.
样本图像可以通过图像采集设备得到,其中,用于采集用户的人脸信息的图像采集设备可以是专用相机或集成在其他设备中的相机等。然而,实际应用中由于图像采集设备的硬件参数不同、设置不同等等,所采集的图像可能不满足上述预设分辨率条件,为得到满足上述预设分辨率条件的样本图像,在本申请的一种可选实现方式中,还可以在图像采集设备采集到图像之后,对所采集到的图像进行缩放处理,以获得符合条件的样本图像。The sample image may be obtained by an image acquisition device, wherein the image acquisition device for collecting the face information of the user may be a dedicated camera or a camera integrated in other devices or the like. However, in actual applications, due to different hardware parameters, different settings, and the like of the image acquisition device, the acquired image may not satisfy the above preset resolution condition, in order to obtain a sample image that satisfies the above preset resolution condition, in the present application In an optional implementation manner, after the image acquisition device acquires the image, the acquired image is subjected to scaling processing to obtain a sample image that meets the condition.
得到样本图像后,可以在每张样本图像中标注人脸动作状态,例如微笑、嘟嘴、闭左眼、闭右眼、闭双眼等,可以将每张样本图像中标注的人脸动作状态与该样本图像作为训练数据存储。After obtaining the sample image, the face motion state can be marked in each sample image, such as smile, pout, closed left eye, closed right eye, closed eyes, etc., and the face motion state marked in each sample image can be This sample image is stored as training data.
为了使得对样本图像中的人脸动作状态的检测更加准确,可以对样本图像中的人脸进行定位,从而得到样本图像中人脸的准确位置,具体可参见下述步骤S220的处理。In order to make the detection of the face motion state in the sample image more accurate, the face in the sample image can be positioned to obtain the exact position of the face in the sample image. For details, refer to the following step S220.
在步骤S220,对每张样本图像,检测样本图像中的人脸和人脸关键点,通过人脸关键点对样本图像中的人脸进行定位,得到人脸定位信息。In step S220, for each sample image, a face and a face key point in the sample image are detected, and a face in the sample image is positioned by the face key point to obtain face location information.
在本申请各实施例中,每张人脸都有一定的特征点,例如眼角、眉毛的末端、嘴角、鼻尖等特征点,再如人脸的边界点等,在获得了人脸的特征点后,通过人脸关键点可以计算该样本图像中的人脸到预先设定的标准人脸的映射或者相似变换,将该样本图像中的人脸与上述标准人脸对齐,从而将样本图像中的人脸进行定位,得到样本图像中人脸的定位信息。In each embodiment of the present application, each face has a certain feature point, such as a corner of the eye, an end of the eyebrow, a corner of the mouth, a tip of the nose, and the like, and a boundary point of the face, etc., in which the feature point of the face is obtained. Then, a face key point can be used to calculate a mapping of a face in the sample image to a preset standard face or a similar transformation, and the face in the sample image is aligned with the standard face, so that the sample image is The face is positioned to obtain the positioning information of the face in the sample image.
在步骤S230,将包含人脸定位信息的样本图像作为训练样本。In step S230, a sample image containing face positioning information is taken as a training sample.
在一个可选示例中,步骤S210~S230可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的训练样本获取模块504执行。In an alternative example, steps S210-S230 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a training sample acquisition module 504 that is executed by the processor.
在步骤S240,使用训练样本对第一卷积网络模型进行训练,得到用于检测图像中人脸动作状态的第一卷积网络模型。In step S240, the first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting the state of the facial motion in the image.
在本申请各实施例中,第一卷积网络模型的前端可以包括多个卷积层、池化层和非线性层的组合,其后端可以是损耗层,例如基于代价函数(softmax)和/或叉熵函数(crossentropy)等算法的损耗层。In various embodiments of the present application, the front end of the first convolutional network model may include a combination of multiple convolutional layers, pooling layers, and non-linear layers, and the back end may be a loss layer, such as based on a cost function (softmax) and / or loss layer of an algorithm such as cross entropy.
第一卷积网络模型的结构可包括:The structure of the first convolutional network model may include:
输入层:该输入层用于读入样本图像和标注的人脸动作状态的信息等。该输入层可以对样本图像进行预处理,输出包括定位信息的人脸图像或者人脸信息等。输入层将经过预处理的人脸图像输出到卷积层,同时将经过预处理的人脸信息输入到损耗层。Input layer: This input layer is used to read the sample image and the information of the marked face action state. The input layer may preprocess the sample image, output a face image or face information including positioning information, and the like. The input layer outputs the preprocessed face image to the convolution layer, and inputs the preprocessed face information to the loss layer.
卷积层:其输入是经过预处理的人脸图像或者图像特征,通过预定的线性变换输出得到人脸图像的特征。Convolutional layer: The input is a pre-processed face image or image feature, and the features of the face image are obtained by a predetermined linear transformation output.
非线性层:非线性层可以通过非线性函数对卷积层输入的特征进行非线性变换,使得其输出的特征有较强的表达能力。Nonlinear layer: The nonlinear layer can nonlinearly transform the features of the convolutional layer input through nonlinear functions, so that the characteristics of its output have strong expression ability.
池化层:该池化层可以将多个数值映射到一个数值,因此,该池化层可以加强学习到的特征的非线性,而且可以使得输出的特征的空间大小变小,从而增强学习的特征的平移(即人脸平移)不变性, 保持提取的特征不变。其中,池化层的输出特征可以再次作为卷积层的输入数据或者全连接层的输入数据。Pooling layer: The pooling layer can map multiple values to a single value. Therefore, the pooling layer can enhance the nonlinearity of the learned features, and can make the spatial size of the output features smaller, thereby enhancing learning. The translation of the feature (ie face translation) is invariant, Keep the extracted features unchanged. Wherein, the output feature of the pooling layer can be used again as the input data of the convolution layer or the input data of the fully connected layer.
其中,卷积层、非线性层和池化层可以重复一次或者多次,即卷积层、非线性层和池化层的组合可以重复一次或多次,其中,对于每一次,池化层的输出数据可以作为卷积层的再次输入数据。卷积层、非线性层和池化层三层的多次组合,可以更好的处理输入的样本图像,使得样本图像中的特征具有最佳的表达能力。Wherein, the convolution layer, the nonlinear layer and the pooling layer may be repeated one or more times, that is, the combination of the convolution layer, the nonlinear layer and the pooling layer may be repeated one or more times, wherein, for each time, the pooling layer The output data can be used as the re-input data for the convolutional layer. Multiple combinations of the convolutional layer, the nonlinear layer and the pooled layer can better process the input sample image, so that the features in the sample image have the best expression ability.
全连接层:对池化层的输入数据进行线性变换,将学习得到的特征投影到一个更好的子空间以利于人脸动作状态预测。Fully connected layer: linearly transforms the input data of the pooled layer, and projects the learned features into a better subspace to facilitate the prediction of the face state.
非线性层:与前述非线性层的功能一样,对全连接层的输入特征进行非线性变换。其输出特征可以作为损耗层的输入数据或者再次作为全连接层的输入数据。Nonlinear layer: As with the function of the aforementioned nonlinear layer, the input characteristics of the fully connected layer are nonlinearly transformed. Its output characteristics can be used as input data for the loss layer or as input data for the fully connected layer again.
其中,全连接层和非线性层可以重复一次或者多次。Wherein, the fully connected layer and the nonlinear layer may be repeated one or more times.
一个或者多个损耗层:主要负责计算预测的人脸动作状态与输入的人脸动作状态的误差。One or more loss layers: mainly responsible for calculating the error of the predicted face action state and the input face action state.
可以通过向后传递的梯度下降算法,训练得到第一卷积网络模型中的网络参数,这样可以使得输入层只需输入图像,即可输出与输入图像中的人脸相应的人脸动作状态的信息,从而得到第一卷积网络模型。The network parameter in the first convolutional network model can be trained by the gradient descent algorithm passed backwards, so that the input layer can input the image, and the face action state corresponding to the face in the input image can be output. Information to obtain a first convolutional network model.
通过上述过程,输入层负责简单处理输入样本图像,卷积层、非线性层和池化层的组合负责对样本图像进行特征提取,全连接层和非线性层是提取的特征到人脸信息的映射,损耗层负责计算预测误差。通过上述第一卷积网络模型的多层设计,可以使提取的特征具有丰富的表达能力,更好的预测人脸动作状态。另外,多个人脸信息同时连接损耗层,可确保多个任务同时学习,共享卷积网络学到的特征。Through the above process, the input layer is responsible for simply processing the input sample image, and the combination of the convolutional layer, the nonlinear layer and the pooling layer is responsible for feature extraction of the sample image, and the fully connected layer and the nonlinear layer are extracted features to face information. The mapping, loss layer is responsible for calculating the prediction error. Through the multi-layer design of the first convolutional network model described above, the extracted features can have rich expression ability and better predict the state of the face action. In addition, multiple face information is connected to the loss layer at the same time, which ensures that multiple tasks learn at the same time and share the characteristics learned by the convolutional network.
在一个可选示例中,步骤S240可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一卷积网络模型确定模块505执行。In an alternative example, step S240 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional network model determination module 505 being executed by the processor.
本实施例中,通过训练得到的第一卷积网络模型,可方便后续对当前播放的包含人脸信息的视频图像进行面部动作检测,并将检测到的面部动作与对应的预定面部动作进行匹配,当两者相匹配时,确定待展现的业务对象在视频图像中的展现位置,在该展现位置采用计算机绘图方式绘制待展现的业务对象,这样当业务对象用于展示广告时,一方面,由于采用计算机绘图方式在展现位置绘制待展现的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的面部动作紧密结合,既有利于保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,还不会打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,能够在一定程度上吸引观众的注意力,提高业务对象的影响力。In this embodiment, the first convolutional network model obtained by the training can facilitate subsequent facial motion detection on the currently played video image including the face information, and match the detected facial motion with the corresponding predetermined facial motion. Determining a presentation position of the business object to be presented in the video image when the two match, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, Since the business object to be presented is drawn in the presentation position by using the computer drawing method, the business object is combined with the video playing, and no additional advertising video data irrelevant to the video is transmitted through the network, which is beneficial to saving network resources and/or system resources of the client; On the other hand, the business object is closely combined with the facial motion in the video image, which is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image without disturbing the user. Normal viewing of the video helps to reduce the user's exposure to the video image Objectionable business objects, to attract the viewer's attention to a certain extent, increase the influence of business objects.
图3是本申请视频图像的处理方法另一实施例的流程图。本实施例中,业务对象为包含有语义信息的特效,示例性地,包含有语义信息的特效可包括:包含广告信息的以下至少一种或任意多种形式的特效:二维贴纸特效、三维特效、粒子特效等。视频图像可为直播类视频图像,例如花椒直播平台中某主播进行视频直播时的视频图像。如图3所示,该实施例视频图像的处理方法包括:3 is a flow chart of another embodiment of a method for processing a video image of the present application. In this embodiment, the service object is a special effect including semantic information. Illustratively, the special effect including the semantic information may include: at least one or any of the following special effects including the advertisement information: two-dimensional sticker special effect, three-dimensional Special effects, particle effects, etc. The video image may be a live video image, such as a video image when a live broadcast of an anchor in the pepper live broadcast platform. As shown in FIG. 3, the processing method of the video image of this embodiment includes:
在步骤S310,获取当前播放的包含人脸信息的视频图像。In step S310, the currently played video image containing the face information is acquired.
其中,上述步骤S310的具体处理可参见与上述图1所示实施例中步骤S110的相关内容,在此不再赘述。For the specific processing of the foregoing step S310, refer to the related content of the step S110 in the embodiment shown in FIG. 1 , and details are not described herein again.
在步骤S320,从视频图像中提取人脸关键点,使用预先训练好的、用于检测图像中人脸动作状态的第一卷积网络模型,根据该人脸关键点确定视频图像中人脸的面部动作。In step S320, a face key point is extracted from the video image, and a first convolutional network model for detecting a face action state in the image is used, and a face of the video image is determined according to the face key point. Facial movements.
在实施中,可对视频图像进行检测,以判断视频图像中是否包括人脸。如果判断出视频图像中包括人脸,则在视频图像中提取人脸关键点。可以将获取到的视频图像和人脸关键点输入到第一卷积网络模型中,该第一卷积网络模型例如通过上述图2所示实施例训练得到。通过第一卷积网络模型中的网络参数可以分别对视频图像进行如特征提取、映射和变换等处理,以对视频图像进行人脸的动作检测,得到视频图像中人脸动作状态,从而基于人脸的动作状态,可以确定视频图像中包含的人脸的面 部动作。In an implementation, the video image may be detected to determine whether a human face is included in the video image. If it is determined that the video image includes a human face, the face key point is extracted in the video image. The acquired video image and face key points can be input into a first convolutional network model, which is trained, for example, by the embodiment shown in FIG. 2 above. Through the network parameters in the first convolutional network model, the video image can be separately processed, such as feature extraction, mapping and transformation, to detect the motion of the face of the video image, and obtain the state of the face motion in the video image, thereby based on the human The action state of the face, which can determine the face of the face contained in the video image Department action.
需要说明的是,对于由多个人脸动作状态组合得到的面部动作(例如眨眼,可由睁眼、闭眼和睁眼,或者闭眼、睁眼和闭眼组合而成)来说,可以将此类型的面部动作划分为多个人脸动作状态,例如,以眨眼为例,可划分为睁眼状态和闭眼状态,则上述处理具体可以为:从视频图像中提取人脸关键点,使用预先训练好的、用于检测图像中人脸动作状态的第一卷积网络模型,确定视频图像中人脸动作状态,并根据视频图像中人脸动作状态确定视频图像中人脸的面部动作。It should be noted that, for a facial motion obtained by combining a plurality of facial motion states (for example, blinking, blinking, closing, and blinking, or closing eyes, blinking, and closing eyes), The facial motion of the type is divided into a plurality of facial motion states. For example, in the case of blinking, the blinking state and the closed eye state may be divided into: the above processing may specifically: extract a key point of the face from the video image, and use the pre-training. Preferably, the first convolutional network model for detecting the state of the face motion in the image determines the state of the face motion in the video image, and determines the facial motion of the face in the video image according to the state of the face motion in the video image.
在本实施例中,可以获取当前播放的包含人脸信息的多张视频图像,可对多张视频图像的连续性进行判断,以判断上述多张视频图像在空间和时间上是否连续。若判断为不连续,则认证失败或提醒用户需要重新获取视频图像。在进行视频图像连续性判断时,例如可将每一帧视频图像分为3x3个区域,在每个区域上建立颜色直方图、灰度的均值和方差,将相邻两张人脸图像的直方图的距离、灰度均值的距离以及灰度方差的距离当作特征向量,通过线性分类器来判断线性分类器是否大于或等于零。其中,线性分类器中的参数可以通过具有标注信息的样本数据训练得到。如果线性分类器被判断为大于或等于零,则认为上述的相邻的两张视频图像在时间和空间上是连续的,此时,可以基于每张视频图像提取的人脸关键点确定相应的人脸动作状态,以便确定连续的多张视频图像所展现的脸部动作;如果线性分类器被判断为小于零,则认为上述的相邻的两张视频图像在时间和空间上为不连续,此时,可以当前视频图像为起点,继续对后续视频图像执行上述步骤S310的处理。In this embodiment, a plurality of video images currently containing the face information may be acquired, and the continuity of the plurality of video images may be determined to determine whether the plurality of video images are continuous in space and time. If it is judged to be discontinuous, the authentication fails or the user is prompted to reacquire the video image. When performing video image continuity determination, for example, each frame of video image can be divided into 3×3 regions, and a color histogram, a mean and variance of gray scales, and a histogram of two adjacent face images are established on each region. The distance of the graph, the distance of the gray mean and the distance of the gray variance are treated as feature vectors, and the linear classifier is used to determine whether the linear classifier is greater than or equal to zero. Among them, the parameters in the linear classifier can be trained by the sample data with the annotation information. If the linear classifier is judged to be greater than or equal to zero, then the two adjacent video images are considered to be continuous in time and space. At this time, the corresponding person may be determined based on the face key points extracted from each video image. a face action state to determine a face motion exhibited by a plurality of consecutive video images; if the linear classifier is judged to be less than zero, it is considered that the adjacent two video images are discontinuous in time and space, When the current video image is used as a starting point, the processing of the above step S310 is continued to be performed on the subsequent video image.
如果上述多张视频图像连续,则可基于从每张视频图像提取的人脸关键点,利用第一卷积网络模型来判断某一帧视频图像中人脸的脸部动作的状态,例如,以眨眼为例,此时可以计算睁眼状态的概率或者闭眼状态的概率来判断该视频图像中人脸动作状态。为此,可以在眨眼动作对应的关键点中心附近提取图像块(包含人脸信息),可通过第一卷积网络模型得到人脸动作状态的判断。然后,可以基于每张视频图像中人脸动作状态确定视频图像中人脸的面部动作。If the plurality of video images are continuous, the first convolutional network model may be used to determine the state of the facial motion of the face in the video image of the frame based on the face key points extracted from each video image, for example, In the case of blinking, the probability of the blink state or the probability of the closed eye state can be calculated at this time to determine the state of the face motion in the video image. To this end, an image block (including face information) can be extracted near the center of the key point corresponding to the blinking action, and the state of the face action state can be obtained by the first convolutional network model. Then, the facial motion of the face in the video image can be determined based on the state of the face motion in each video image.
对于可以由一个人脸动作状态即可确定相应的面部动作(例如微笑、张嘴、嘟嘴等)的情况来说,可以通过检测到的带有微笑、张嘴或嘟嘴等人脸动作状态的视频图像,根据上述步骤S320的处理即可确定相应的人脸的面部动作。For a situation in which a face motion state can be determined by a face action state (eg, smile, mouth opening, beeping, etc.), a detected video with a face state of a smile, a mouth opening, or a beep can be detected. The image can be determined according to the processing of the above step S320 to determine the facial motion of the corresponding face.
在一个可选示例中,步骤S320可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的视频图像检测模块501执行。In an alternative example, step S320 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by video image detection module 501 being executed by the processor.
在步骤S330,当确定检测到的面部动作与对应的预定面部动作相匹配时,提取与检测到的面部动作相应的人脸区域内的人脸特征点。In step S330, when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the facial feature point in the facial region corresponding to the detected facial motion is extracted.
在本申请各实施例中,对于包含人脸信息的每个视频图像,其中人脸中包含有一定的特征点,例如眼睛、鼻子、嘴巴、脸部轮廓等特征点。对视频图像中的人脸进行检测并确定特征点,可以采用任意适当的相关技术中的方式实现,本申请实施例对此不作限定。例如,线性特征提取方式如主成分分析(PCA)、线性判别分析(LDA)、独立成分分析(ICA)等等;再如,非线性特征提取方式如核主成分分析(Kernel PCA)、流形学习等;也可以使用训练完成的神经网络模型如本申请实施例中的卷积网络模型进行人脸特征点的提取,本申请实施例对此不作限制。In each embodiment of the present application, for each video image including face information, the face includes certain feature points, such as feature points such as eyes, nose, mouth, and facial contour. The detection of the face in the video image and the determination of the feature point can be implemented in any suitable related art, which is not limited in the embodiment of the present application. For example, linear feature extraction methods such as principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), etc.; for example, nonlinear feature extraction methods such as kernel principal component analysis (Kernel PCA), manifolds The learning and the like can also be performed using the trained neural network model, such as the convolutional network model in the embodiment of the present application, which is not limited in the embodiment of the present application.
以直播类视频为例,在进行视频直播的过程中,从直播类视频的视频图像中检测人脸并确定人脸特征点;再如,在某一已录制完成的视频的播放过程中,从播放的视频图像中检测人脸并确定人脸特征点;又如,在某一视频的录制过程中,从录制的视频图像中检测人脸并确定人脸特征点等等。Taking a live video as an example, in the process of video live broadcast, the face is detected from the video image of the live video and the face feature point is determined; for example, during the playback of a recorded video, The face image is detected in the played video image and the face feature point is determined; for example, in the recording process of a certain video, the face is detected from the recorded video image and the face feature point is determined.
在步骤S340,根据人脸特征点,确定待展现的业务对象在视频图像中的展现位置。In step S340, according to the face feature point, the presentation position of the business object to be presented in the video image is determined.
在一个可选示例中,步骤S330~S340可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的展现位置确定模块502执行。In an alternative example, steps S330-S340 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 that is executed by the processor.
在本申请各实施例中,在人脸特征点确定后,可以以此为依据,确定待展现的业务对象在视频图像中的一个或多个展现位置。In various embodiments of the present application, after the face feature point is determined, the one or more presentation positions of the business object to be presented in the video image may be determined based on the basis.
在本申请各实施例中,在根据人脸特征点确定待展现的业务对象在视频图像中的展现位置时,可选的实现方式包括但不限于以下方式:方式一,根据人脸特征点,使用预先训练好的、用于确定业务对象在视频图像中的展现位置的第二卷积网络模型,确定待展现的业务对象在视频图像中的展现位置; 方式二,根据人脸特征点和待展现的业务对象的类型,确定待展现的业务对象在视频图像中的展现位置。In the embodiments of the present application, when determining the presentation position of the business object to be presented in the video image according to the facial feature point, the optional implementation manner includes, but is not limited to, the following manner: manner one, according to the facial feature point, Determining a presentation location of the business object to be presented in the video image using a pre-trained second convolutional network model for determining a presentation location of the business object in the video image; In the second manner, the presentation location of the business object to be presented in the video image is determined according to the face feature point and the type of the business object to be presented.
以下,分别对上述两种方式进行说明。Hereinafter, the above two modes will be described separately.
方式一method one
在使用方式一确定待展现的业务对象在视频图像中的展现位置时,可以预先训练一个卷积网络模型,即第二卷积网络模型,训练完成的第二卷积网络模型具有确定业务对象在视频图像中的展现位置的功能;或者,也可以直接使用第三方已训练完成的、具有确定业务对象在视频图像中的展现位置的功能的卷积网络模型。When the usage mode 1 determines the presentation position of the business object to be presented in the video image, a convolutional network model, that is, a second convolutional network model, may be pre-trained, and the trained second convolutional network model has the determined service object The function of presenting the position in the video image; alternatively, a convolutional network model that has been trained by a third party to have the function of determining the presentation position of the business object in the video image can also be used directly.
需要说明的是,本申请各实施例例中,以对业务对象的训练为例进行说明,但本领域技术人员应当明了,第二卷积网络模型在对业务对象进行训练的同时,也可以对人脸进行训练,实现人脸和业务对象的联合训练。It should be noted that, in the embodiments of the present application, the training of the business object is taken as an example, but those skilled in the art should understand that the second convolutional network model can also train the business object at the same time. The face is trained to achieve joint training of faces and business objects.
训练第二卷积网络模型时,一种可选的训练方式包括以下过程:When training the second convolutional network model, an optional training method includes the following process:
(1)获取训练样本的样本图像的特征向量。(1) Obtaining a feature vector of a sample image of the training sample.
其中,特征向量中包含有训练样本的样本图像中的业务对象的位置信息和/或置信度信息,以及样本图像中面部动作相应的人脸区域内的人脸特征点对应的人脸特征向量。业务对象的置信度信息指示了业务对象展现在当前位置时,能够达到的效果(如被关注或被点击或被观看)的概率,该概率可以根据对历史数据的统计分析结果设定,也可以根据仿真实验的结果设定,还可以根据人工经验进行设定。在实际应用中,可以根据实际需要,仅对业务对象的位置信息进行训练,也可以仅对业务对象的置信度信息进行训练,还可以对位置信息和置信度信息均进行训练。对位置信息和置信度信息均进行训练,能够使得训练后的第二卷积网络模型可以更为有效和精准地确定业务对象的位置信息和置信度信息,以便为业务对象的展示提供依据。The feature vector includes position information and/or confidence information of the business object in the sample image of the training sample, and a face feature vector corresponding to the face feature point in the face region corresponding to the facial motion in the sample image. The confidence information of the business object indicates the probability that the business object can achieve the effect (such as being focused or clicked or viewed) when the current location is displayed. The probability may be set according to the statistical analysis result of the historical data, or may be According to the results of the simulation experiment, it can also be set according to the artificial experience. In practical applications, only the location information of the service object may be trained according to actual needs, or only the confidence information of the service object may be trained, and the location information and the confidence information may be trained. The training of the location information and the confidence information enables the trained second convolutional network model to more effectively and accurately determine the location information and confidence information of the business object, so as to provide a basis for the display of the business object.
第二卷积网络模型通过大量的样本图像进行训练,本申请实施例中,可使用包含有业务对象的样本图像对第二卷积网络模型进行训练,本领域技术人员应当明了的是,用来训练的样本图像中,除了包含业务对象外,也应当包含人脸动作状态的信息(即:用于确定人脸的面部动作的信息)。此外,本申请实施例中的样本图像中的业务对象可以预先标注位置信息、或者置信度信息、或者位置信息和置信度信息二种信息。当然,在实际应用中,这些信息也可以通过其它途径获取。而通过预先在对业务对象进行相应信息的标注,可以有效节约数据处理的数据和交互次数,提高数据处理效率。The second convolutional network model is trained by a large number of sample images. In the embodiment of the present application, the second convolutional network model can be trained using the sample image including the business object, and those skilled in the art should understand that The sample image of the training, in addition to the business object, should also contain information on the state of the face action (ie, information for determining the face action of the face). In addition, the business object in the sample image in the embodiment of the present application may pre-label the location information, or the confidence information, or the location information and the confidence information. Of course, in practical applications, this information can also be obtained through other means. By marking the corresponding information on the business object in advance, the data processing data and the number of interactions can be effectively saved, and the data processing efficiency is improved.
将具有业务对象的位置信息和/或置信度信息,以及某种人脸属性的样本图像作为训练样本,对其进行特征向量提取,获得包含有业务对象的位置信息和/或置信度信息的特征向量,以及人脸特征点对应的人脸特征向量。The location information and/or confidence information of the business object and the sample image of a certain face attribute are used as training samples, and the feature vector is extracted to obtain the feature information including the location information and/or the confidence information of the business object. The vector, and the face feature vector corresponding to the face feature point.
可选地,可以使用第二卷积网络模型对人脸和业务对象同时进行训练,在此情况下,样本图像的特征向量中也包含人脸的特征。Optionally, the second convolutional network model can be used to simultaneously train the face and the business object. In this case, the feature vector of the sample image also includes the features of the face.
对特征向量的提取可以采用相关技术中的适当方式实现,本申请实施例在此不再赘述。The extraction of the feature vector can be implemented in an appropriate manner in the related art, and details are not described herein again.
(2)对特征向量进行卷积处理,获取特征向量卷积结果。(2) Convolution processing of the feature vector to obtain the feature vector convolution result.
在实施中,获取的特征向量卷积结果中包含有业务对象的位置信息和/或置信度信息,人脸动作状态对应的人脸特征向量对应的特征向量卷积结果。在对人脸和业务对象进行联合训练的情况下,特征向量卷积结果中还包含人脸动作状态的信息。In the implementation, the obtained feature vector convolution result includes the location information and/or the confidence information of the business object, and the feature vector convolution result corresponding to the face feature vector corresponding to the face action state. In the case of joint training of faces and business objects, the feature vector convolution result also contains information on the state of the face action.
对特征向量的卷积处理次数可以根据实际需要进行设定,也即,第二卷积网络模型中,卷积层的层数根据实际需要进行设置,在此不再赘述。The number of times of convolution processing on the feature vector can be set according to actual needs, that is, in the second convolutional network model, the number of layers of the convolution layer is set according to actual needs, and details are not described herein again.
卷积结果是对特征向量进行了特征提取后的结果,该结果可有效表征视频图像中人脸的特征对应的业务对象。The convolution result is the result of feature extraction of the feature vector, which can effectively represent the business object corresponding to the feature of the face in the video image.
本申请实施例中,当特征向量中既包含业务对象的位置信息,又包含业务对象的置信度信息时,也即,对业务对象的位置信息和置信度信息均进行了训练的情况下,该特征向量卷积结果在后续分别进行收敛条件判断时共享,无须进行重复处理和计算,有利于减少由数据处理引起的资源损耗,有利于提高数据处理速度和效率。 In the embodiment of the present application, when the feature vector includes both the location information of the service object and the confidence information of the service object, that is, when the location information and the confidence information of the service object are trained, The eigenvector convolution result is shared in the subsequent judgment of the convergence condition, and no need to perform repeated processing and calculation, which is beneficial to reduce the resource loss caused by data processing, and is beneficial to improve data processing speed and efficiency.
(3)判断该特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足业务对象收敛条件,并判断该特征向量卷积结果中对应的人脸特征向量是否满足人脸收敛条件。(3) determining whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determining whether the corresponding face feature vector in the feature vector convolution result satisfies the face Convergence conditions.
其中,收敛条件由本领域技术人员根据实际需求适当设定。当信息满足收敛条件时,可以认为第二卷积网络模型中的网络参数设置适当;当信息不能满足收敛条件时,可以认为第二卷积网络模型中的网络参数设置不适当,需要对其进行调整,该调整可以是一个迭代的过程,直至使用调整后的网络参数对特征向量进行卷积处理的结果满足收敛条件。The convergence condition is appropriately set by a person skilled in the art according to actual needs. When the information satisfies the convergence condition, it can be considered that the network parameters in the second convolutional network model are properly set; when the information cannot satisfy the convergence condition, it can be considered that the network parameters in the second convolutional network model are not properly set and need to be performed. Adjustment, the adjustment may be an iterative process until the result of convolution processing the feature vector using the adjusted network parameters satisfies the convergence condition.
一种可选方式中,收敛条件可以根据预设的标准位置和/或预设的标准置信度进行设定,例如,将特征向量卷积结果中业务对象的位置信息指示的位置与预设的标准位置之间的距离满足一定阈值作为业务对象的位置信息的收敛条件;将特征向量卷积结果中业务对象的置信度信息指示的置信度与预设的标准置信度之间的差别满足一定阈值作为业务对象的置信度信息的收敛条件等。In an optional manner, the convergence condition may be set according to a preset standard location and/or a preset standard confidence, for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold The convergence condition of the confidence information as a business object, and the like.
其中,可选地,预设的标准位置可以是对待训练的样本图像中的业务对象的位置进行平均处理后获得的平均位置;预设的标准置信度可以是对待训练的样本图像中的业务对象的置信度进行平均处理后获取的平均置信度。因样本图像为待训练样本且数据量庞大,可依据待训练的样本图像中的业务对象的位置和/或置信度设定标准位置和/或标准置信度,以便设定的标准位置和标准置信度也更为客观和精确。Optionally, the preset standard location may be an average position obtained by averaging the positions of the service objects in the sample image to be trained; the preset standard confidence may be a business object in the sample image to be trained. The confidence of the average confidence obtained after averaging processing. Since the sample image is a sample to be trained and the amount of data is large, the standard position and/or standard confidence can be set according to the position and/or confidence of the business object in the sample image to be trained, so as to set the standard position and standard confidence. The degree is also more objective and precise.
在具体进行特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足收敛条件的判断时,一种可选的方式包括:When determining whether the location information and/or the confidence information of the corresponding service object meets the convergence condition in the feature vector convolution result, an optional manner includes:
获取特征向量卷积结果中对应的业务对象的位置信息,通过计算对应的业务对象的位置信息指示的位置与预设的标准位置之间的欧式距离,得到对应的业务对象的位置信息指示的位置与预设的标准位置之间的第一距离,根据第一距离判断对应的业务对象的位置信息是否满足收敛条件;Obtaining the location information of the corresponding service object in the feature vector convolution result, and calculating the location indicated by the location information of the corresponding service object by calculating the Euclidean distance between the location indicated by the location information of the corresponding service object and the preset standard location Determining, according to the first distance, a first distance between the preset standard position and determining whether the location information of the corresponding service object satisfies the convergence condition;
和/或,and / or,
获取特征向量卷积结果中对应的业务对象的置信度信息,计算对应的业务对象的置信度信息指示的置信度与预设的标准置信度之间的欧式距离,得到对应的业务对象的置信度信息指示的置信度与预设的标准置信度之间的第二距离,根据第二距离判断对应的业务对象的置信度信息是否满足收敛条件。其中,采用欧式距离的方式,实现简单且能够有效指示收敛条件是否被满足。但本申请实施例并不限于此,还可以采用马式距离、巴式距离等其它方式。Obtaining the confidence information of the corresponding service object in the feature vector convolution result, calculating the Euclidean distance between the confidence level indicated by the confidence information of the corresponding service object and the preset standard confidence, and obtaining the confidence of the corresponding business object. A second distance between the confidence level of the information indication and the preset standard confidence, and determining, according to the second distance, whether the confidence information of the corresponding service object satisfies the convergence condition. Among them, the Euclidean distance method is adopted, and the implementation is simple and can effectively indicate whether the convergence condition is satisfied. However, the embodiment of the present application is not limited thereto, and other methods such as a horse distance, a bar distance, and the like may also be adopted.
可选地,如前所述,预设的标准位置为对待训练的样本图像中的业务对象的位置进行平均处理后获得的平均位置;和/或,预设的标准置信度为对待训练的样本图像中的业务对象的置信度进行平均处理后获取的平均置信度。Optionally, as described above, the preset standard position is an average position obtained by averaging the positions of the business objects in the sample image to be trained; and/or, the preset standard confidence is the sample to be trained. The average confidence obtained by averaging the confidence of the business objects in the image.
对于判断该特征向量卷积结果中对应的人脸特征向量满足的人脸收敛条件可以由本领域技术人员根据实际情况进行设定,本申请实施例对此不做限定。The method for determining the face convergence condition that the corresponding face feature vector is satisfied in the feature vector convolution result can be set by a person skilled in the art according to the actual situation, which is not limited by the embodiment of the present application.
(4)若上述收敛条件都满足,即:业务对象的位置信息和/或置信度信息满足业务对象收敛条件、且人脸特征向量满足人脸收敛条件,则完成对第二卷积网络模型的训练;否则,只要有一个收敛条件不满足,例如,业务对象的位置信息和/或置信度信息不满足业务对象收敛条件、和/或人脸特征向量不满足人脸收敛条件,调整第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对该第二卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息以及人脸特征向量均满足相应的收敛条件。(4) If the convergence conditions are satisfied, that is, the location information and/or the confidence information of the service object satisfy the convergence condition of the service object, and the face feature vector satisfies the face convergence condition, the second convolutional network model is completed. Training; otherwise, as long as there is a convergence condition that is not satisfied, for example, the location information and/or confidence information of the business object does not satisfy the convergence condition of the business object, and/or the face feature vector does not satisfy the face convergence condition, adjust the second volume. Network parameters of the network model, and iteratively training the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the location information and/or confidence information of the service object after the iterative training and The face feature vectors satisfy the corresponding convergence conditions.
通过对第二卷积网络模型进行上述训练,第二卷积网络模型可以对基于人脸进行展示的业务对象的展现位置进行特征提取和分类,从而具有确定业务对象在视频图像中的展现位置的功能。其中,当展现位置包括多个时,通过上述业务对象置信度的训练,第二卷积网络模型还可以确定出多个展现位置中的展示效果的优劣顺序,从而确定最终的展现位置。在后续应用中,当需要展示业务对象时,根据视频中的当前图像即可确定出有效的展现位置。By performing the above training on the second convolutional network model, the second convolutional network model can feature extracting and classifying the presentation position of the business object based on the face presentation, thereby having the position of determining the presentation position of the business object in the video image. Features. Wherein, when the presentation location includes a plurality of, through the training of the above-mentioned business object confidence, the second convolutional network model may also determine the order of the presentation effects in the plurality of presentation locations, thereby determining the final presentation location. In subsequent applications, when a business object needs to be displayed, a valid presentation location can be determined based on the current image in the video.
此外,在对第二卷积网络模型进行上述训练之前,还可以预先对样本图像进行预处理,例如可以包括:获取多个样本图像,其中,每个样本图像中包含有业务对象的标注信息;根据标注信息确定业务对象的位置,判断确定的业务对象的位置与预设位置的距离是否小于或等于设定阈值;将小于或等 于设定阈值的业务对象对应的样本图像,确定为待训练的样本图像。其中,预设位置和设定阈值均可以由本领域技术人员采用任意适当方式进行适当设置,如根据数据统计分析结果或者相关距离计算公式或者人工经验等,本申请实施例对此不作限定。In addition, before performing the foregoing training on the second convolutional network model, the sample image may be pre-processed, for example, may include: acquiring a plurality of sample images, where each sample image includes annotation information of the business object; Determining the location of the business object according to the annotation information, determining whether the distance between the determined location of the business object and the preset location is less than or equal to a set threshold; The sample image corresponding to the business object whose threshold is set is determined as the sample image to be trained. The preset position and the set threshold may be appropriately set by any suitable means by a person skilled in the art, for example, according to the statistical analysis result of the data or the related distance calculation formula or the artificial experience, etc., which is not limited by the embodiment of the present application.
通过预先对样本图像进行预处理,可以过滤掉不符合条件的样本图像,以提高训练结果的准确性。By pre-processing the sample image, the sample image that does not meet the conditions can be filtered out to improve the accuracy of the training result.
通过上述过程实现了第二卷积网络模型的训练,训练完成的第二卷积网络模型可以用来确定业务对象在视频图像中的展现位置。例如,在视频直播过程中,若主播点击业务对象指示进行业务对象展示时,在第二卷积网络模型获得了直播的视频图像中主播的面部特征点后,可以指示出展示业务对象的最优位置如主播的额头位置,进而控制直播应用在该位置展示业务对象;或者,在视频直播过程中,若主播点击业务对象指示进行业务对象展示时,第二卷积网络模型可以直接根据直播的视频图像确定业务对象的展现位置。The training of the second convolutional network model is implemented by the above process, and the trained second convolutional network model can be used to determine the presentation position of the business object in the video image. For example, in the live broadcast process, if the anchor clicks on the business object to indicate the display of the business object, after the second convolutional network model obtains the facial feature point of the anchor in the live video image, the optimal display service object may be indicated. The location, such as the forehead position of the anchor, controls the live application to display the business object at the location; or, during the live broadcast of the video, if the anchor clicks on the business object to indicate the display of the business object, the second convolutional network model can directly be based on the live video. The image determines the presentation location of the business object.
方式二Way two
根据人脸特征点和待展现的业务对象的类型,确定待展现的业务对象在视频图像中的展现位置。Determining the presentation position of the business object to be presented in the video image according to the face feature point and the type of the business object to be presented.
在本实施中,在获取了人脸特征点之后,可以按照设定的规则确定待展现的业务对象的展现位置。其中,确定待展现的业务对象的展现位置例如可以包括以下至少之一或任意多个:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域等。其中,该预先设定的区域可以由根据实际情况适当设置,例如,以人脸区域为中心的设定范围内的区域,或者,人脸区域以外的设定范围内的区域,或者背景区域等等,本申请实施例对此不作限制。In this implementation, after the facial feature points are acquired, the presentation position of the business object to be presented may be determined according to the set rules. Wherein, determining the presentation position of the business object to be presented may include, for example, at least one or any of the following: a hair region of the character in the video image, a forehead region, a cheek region, a chin region, a body region other than the head, and a video image. The background area, the area within the setting range centering on the area where the hand is located in the video image, and the area preset in the video image. The preset area may be appropriately set according to an actual situation, for example, an area within a setting range centering on a face area, or an area within a setting range other than a face area, or a background area, or the like. The embodiment of the present application does not limit this.
在确定了展现位置后,可以进一步确定待展现的业务对象在视频图像中的展现位置。例如,以展现位置对应的展现区域的中心点为业务对象的展现位置中心点进行业务对象的展示;再例如,将展现位置对应的展现区域中的某一坐标位置确定为展现位置的中心点等,本申请实施例对此不作限定。After the presentation location is determined, the presentation location of the business object to be presented in the video image may be further determined. For example, the center point of the presentation area corresponding to the presentation location is used to display the business object as the center point of the presentation location; for example, a certain coordinate position in the presentation area corresponding to the presentation location is determined as the center point of the presentation location, etc. This embodiment of the present application does not limit this.
在一种可选的实施方案中,在确定待展现的业务对象在视频图像中的展现位置时,不仅可以根据人脸特征点,还可以根据待展现的业务对象的类型,确定待展现的业务对象在视频图像中的展现位置。其中,业务对象的类型包括以下至少之一或任意多种类型:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型、背景类型、虚拟宠物类型、虚拟容器类型。但不限于此,业务对象的类型还可以为其它适当类型,如虚拟瓶盖类型,虚拟杯子类型、文字类型等等。In an optional implementation, when determining the presentation position of the business object to be presented in the video image, the service to be displayed may be determined not only according to the face feature point but also according to the type of the business object to be presented. The position at which the object appears in the video image. The type of the business object includes at least one of the following or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, a virtual headwear type, and a virtual hair. Ornament type, virtual jewelry type, background type, virtual pet type, virtual container type. However, it is not limited thereto, and the type of the business object may be other suitable types, such as a virtual cap type, a virtual cup type, a text type, and the like.
由此,根据业务对象的类型,可以以人脸特征点为参考,为业务对象选择适当的展现位置。Thus, according to the type of the business object, an appropriate presentation position can be selected for the business object with reference to the face feature point.
此外,在根据人脸特征点和待展现的业务对象的类型,获得待展现的业务对象在视频图像中的多个展现位置的情况下,可以从多个展现位置中选择至少一个展现位置作为待展现的业务对象在视频图像中的展现位置。例如,对于文字类型的业务对象,可以展示在背景区域,也可以展示在人物的额头或身体区域等。In addition, in a case where a plurality of presentation positions of the business object to be presented in the video image are obtained according to the face feature point and the type of the business object to be presented, at least one presentation position may be selected from the plurality of presentation positions as the to-be-position The location of the presented business object in the video image. For example, for a text type business object, it can be displayed in the background area, or it can be displayed on the person's forehead or body area.
此外,在本申请各实施例的另一示例中,可以预先存储面部动作与展现位置的对应关系,在确定检测到的面部动作与对应的预定面部动作相匹配时,可从预先存储的面部动作与展现位置的对应关系中,获取预定面部动作对应的目标展现位置作为待展现的业务对象在视频图像中的展现位置。其中,需要说明的是,尽管存在上述面部动作与展现位置的对应关系,但是,面部动作与展现位置并没有必然关系,面部动作仅仅是触发业务对象展现的一种方式,而且展现位置与人脸也不存在必然关系,也即:业务对象可以展现在人脸的某一个区域,也可以显示在人脸之外的其它区域,例如视频图像的背景区域等。In addition, in another example of the embodiments of the present application, the correspondence between the facial motion and the presentation position may be stored in advance, and when the detected facial motion is matched with the corresponding predetermined facial motion, the pre-stored facial motion may be In the correspondence with the presentation position, the target presentation position corresponding to the predetermined facial motion is acquired as the presentation position of the business object to be presented in the video image. It should be noted that, although there is a corresponding relationship between the facial motion and the presentation position, the facial motion is not necessarily related to the presentation position, and the facial motion is only a way to trigger the presentation of the business object, and the position and the face are displayed. There is no necessary relationship, that is, the business object can be displayed in a certain area of the face, or can be displayed in other areas than the face, such as the background area of the video image.
在步骤S350,在展现位置采用计算机绘图方式绘制待展现的业务对象。In step S350, the business object to be presented is drawn in a computer drawing manner at the presentation position.
在一个可选示例中,步骤S350可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的业务对象绘制模块503执行。In an alternative example, step S350 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 503 being executed by a processor.
基于步骤S350,当业务对象为包含有语义信息的贴纸,例如广告贴纸时,在进行业务对象的绘制之前,可以先获取业务对象的相关信息,如业务对象的标识、大小等。在确定了展现位置后,可以根据展现位置的坐标,对业务对象进行缩放、旋转等调整,然后,通过相应的绘图方式如OpenGL图形 绘制引擎的绘制方式对业务对象进行绘制。在某些情况下,广告还可以以三维特效形式展示,例如通过粒子特效方式展示广告的文字或商标(LOGO)等。例如,当主播张嘴时,可通过动态逐渐减少杯子中的液体的方式展示某一产品的广告特效,该广告特效可包括多张不同状态的展示图像(例如包括杯子中液体量逐渐减少的多帧图像)组成的视频帧,通过OpenGL图形绘制引擎的绘制等计算机绘图方式在展现位置上依次绘制视频帧的相应图像,由此展示杯子中的液体量逐渐减少的动态效果。通过这种方式,实现了广告效果的动态展现可以吸引观众观看,提升广告投放和展示的趣味性,提高广告投放和展示效率。本申请实施例提供的视频图像的处理方法,通过面部动作触发业务对象(如广告)的展现,一方面,由于采用计算机绘图方式在展现位置绘制待展现的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外如广告等业务对象的视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的面部动作紧密结合,既有利于保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,还不会打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,能够在一定程度上吸引观众的注意力,提高业务对象的影响力。Based on the step S350, when the business object is a sticker containing the semantic information, such as an advertisement sticker, the related information of the business object, such as the identifier and size of the business object, may be acquired before the drawing of the business object is performed. After determining the presentation position, the business object can be scaled, rotated, etc. according to the coordinates of the presentation position, and then, through corresponding drawing methods such as OpenGL graphics. Drawing the way the engine is drawn to the business object. In some cases, ads can also be displayed in 3D special effects, such as text or logos (LOGOs) that display ads through particle effects. For example, when the anchor opens the mouth, the advertising effect of a certain product can be displayed by dynamically reducing the liquid in the cup, and the advertising effect can include a plurality of display images of different states (for example, including multiple frames in which the amount of liquid in the cup is gradually reduced). The video frame composed of the image is sequentially drawn in the display position by a computer drawing method such as drawing of the OpenGL graphics rendering engine, thereby displaying the dynamic effect of gradually reducing the amount of liquid in the cup. In this way, the dynamic display of the advertising effect can attract viewers to watch, improve the fun of advertising and display, and improve the efficiency of advertising and display. The method for processing a video image provided by the embodiment of the present application triggers the presentation of a business object (such as an advertisement) by a facial action. On the one hand, the business object to be presented is drawn at the presentation position by using a computer drawing method, and the business object is played with the video. In combination, it is not necessary to transmit video data of a business object such as an advertisement that is not related to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the facial motion in the video image, It is beneficial to preserve the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to reducing the user's business object displayed in the video image. Resentment can attract the attention of the audience to a certain extent and improve the influence of business objects.
图4是视频图像的处理方法又一实施例的流程图。本实施例以业务对象为包含有广告信息的二维贴纸特效为例,对本申请实施例的视频图像的处理方案进行说明。如图4所示,本实施例的视频图像的处理方法包括:4 is a flow chart of still another embodiment of a method of processing a video image. In this embodiment, the processing solution of the video image in the embodiment of the present application is described by taking the two-dimensional sticker special effect including the advertisement information as an example. As shown in FIG. 4, the processing method of the video image in this embodiment includes:
在步骤S401,获取多张包括人脸信息的样本图像作为训练样本,其中,样本图像包括标注的人脸动作状态的信息。In step S401, a plurality of sample images including face information are acquired as training samples, wherein the sample images include information of the labeled face action states.
在一个可选示例中,步骤S401可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的训练样本获取模块504执行。In an alternative example, step S401 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a training sample acquisition module 504 being executed by the processor.
在步骤S402,使用训练样本对第一卷积网络模型进行训练,得到用于检测图像中人脸动作状态的第一卷积网络模型。In step S402, the first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting the state of the facial motion in the image.
在一个可选示例中,步骤S402可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一卷积网络模型确定模块505执行。In an alternative example, step S402 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first convolutional network model determination module 505 that is executed by the processor.
上述步骤S401~步骤S402的步骤内容与上述图2所示实施例中的相应内容相同,在此不再赘述。The content of the steps in the above steps S401 to S402 is the same as the corresponding content in the embodiment shown in FIG. 2, and details are not described herein again.
在步骤S403,获取上述训练样本的样本图像的特征向量。In step S403, a feature vector of the sample image of the above training sample is acquired.
其中,特征向量中包含有训练样本的样本图像中的业务对象的位置信息和/或置信度信息,以及样本图像中人脸动作状态对应的人脸特征向量。The feature vector includes position information and/or confidence information of the business object in the sample image of the training sample, and a face feature vector corresponding to the face action state in the sample image.
其中,每一张样本图像中人脸动作状态可以是在对第一卷积网络模型进行训练时确定。Wherein, the face action state in each sample image may be determined when training the first convolutional network model.
在实施中,训练样本的样本图像中存在一些不符合第二卷积网络模型的训练标准的样本图像,需要通过对样本图像的预处理将这部分样本图像过滤掉。In the implementation, there are some sample images in the sample image of the training sample that do not meet the training standard of the second convolutional network model, and this part of the sample image needs to be filtered out by preprocessing the sample image.
首先,本实施例中,每张样本图像中都包含有业务对象,每个业务对象都标注有位置信息和置信度信息。一种可选的实施方案中,将业务对象的中心点的位置信息作为该业务对象的位置信息。本步骤中,根据业务对象的位置信息对样本图像进行过滤。获得位置信息指示的位置的坐标后,将该坐标与预设的该类型的业务对象的位置坐标进行比对,计算二者的位置方差。若该位置方差小于或等于设定的阈值,则该样本图像可以作为待训练的样本图像;若该位置方差大于设定的阈值,则过滤掉该样本图像。其中,预设的位置坐标和设定的阈值均可以由本领域技术人员根据实际情况适当设置,例如,一般用于第二卷积网络模型训练的图像具有相同的大小,因此设定的阈值可以为图像长或宽的1/20~1/5,可选地,可以为图像长或宽的1/10。First, in this embodiment, each sample image includes a business object, and each business object is marked with location information and confidence information. In an optional implementation, the location information of the central point of the business object is used as the location information of the business object. In this step, the sample image is filtered according to the location information of the business object. After obtaining the coordinates of the location indicated by the location information, the coordinates are compared with the preset location coordinates of the business object of the type, and the position variance of the two is calculated. If the position variance is less than or equal to the set threshold, the sample image may be used as the sample image to be trained; if the position variance is greater than the set threshold, the sample image is filtered out. The preset position coordinates and the set thresholds may be appropriately set by a person skilled in the art according to actual conditions. For example, the images generally used for the second convolutional network model training have the same size, so the set threshold may be The length or width of the image is 1/20 to 1/5, and alternatively, it may be 1/10 of the length or width of the image.
此外,还可以对确定的样本图像中的业务对象的位置和置信度进行平均,获取平均位置和平均置信度,该平均位置和平均置信度可以作为后续确定收敛条件的依据。In addition, the location and confidence of the business objects in the determined sample image may be averaged to obtain an average position and an average confidence, which may be used as a basis for determining the convergence condition subsequently.
当以业务对象为广告贴纸为实例时,本实施例中用于训练的样本图像需要标注有最优广告位置的坐标和该广告位的置信度。置信度的大小表示了这个广告位是最优广告位的概率,例如,如果这个广告位是被遮挡多,则置信度低。其中,该最优广告位置可以在人脸、前背景等地方标注,因此可以实现面部特征点、前背景等地方的广告位的联合训练,这相对于基于面部动作等某一项技术单独训练的 方案,有利于节省计算资源。When the business object is an advertisement sticker as an example, the sample image used for training in this embodiment needs to be labeled with the coordinates of the optimal advertisement position and the confidence of the advertisement position. The size of the confidence indicates the probability that this ad slot is the best ad slot. For example, if this ad slot is mostly occluded, the confidence is low. Wherein, the optimal advertisement position can be marked in a face, a front background, and the like, so that joint training of advertisement points in a facial feature point, a front background, and the like can be realized, which is separately trained with respect to a certain technique based on facial motion or the like. The solution is conducive to saving computing resources.
在一个可选示例中,步骤S403可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的征向量获取模块506执行。In an alternative example, step S403 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a eigenvector acquisition module 506 that is executed by the processor.
在步骤S404,对特征向量进行卷积处理,获取特征向量卷积结果。In step S404, the feature vector is convoluted to obtain a feature vector convolution result.
该步骤中,对该特征向量进行卷积处理时,对样本图像中的业务对象的位置信息和/或置信度信息对应的特征向量进行卷积处理,另外还对每一张样本图像中人脸特征点对应的人脸特征向量进行卷积处理,分别得到相应的特征向量卷积结果。In this step, when the feature vector is subjected to convolution processing, convolution processing is performed on the position information of the business object in the sample image and/or the feature vector corresponding to the confidence information, and the face in each sample image is also The face feature vector corresponding to the feature point is convoluted, and the corresponding feature vector convolution result is obtained respectively.
在一个可选示例中,步骤S404可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的卷积模块507执行。In an alternative example, step S404 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolution module 507 executed by the processor.
在步骤S405,判断特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足业务对象收敛条件,并判断特征向量卷积结果中对应的人脸特征向量是否满足人脸收敛条件。In step S405, it is determined whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determines whether the corresponding face feature vector in the feature vector convolution result satisfies the face convergence. condition.
在一个可选示例中,步骤S405可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的收敛条件判断模块508执行。In an alternative example, step S405 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convergence condition determination module 508 that is executed by the processor.
在步骤S406,若步骤S405中的收敛条件都满足,即:特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息满足业务对象收敛条件,且特征向量卷积结果中对应的人脸特征向量满足人脸收敛条件,则完成对第二卷积网络模型的训练;否则,调整第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对该第二卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息以及人脸特征向量均满足相应的收敛条件。In step S406, if the convergence conditions in step S405 are satisfied, that is, the position information and/or the confidence information of the corresponding business object in the feature vector convolution result satisfies the convergence condition of the business object, and the corresponding result in the feature vector convolution result If the face feature vector satisfies the face convergence condition, the training of the second convolutional network model is completed; otherwise, the network parameters of the second convolutional network model are adjusted, and according to the adjusted network parameter pair of the second convolutional network model The second convolutional network model performs iterative training until the position information and/or the confidence information and the face feature vector of the service object after the iterative training satisfy the corresponding convergence condition.
在本实施例中,如果该特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息不满足业务对象收敛条件,则根据特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息,调整第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对该第二卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息满足业务对象收敛条件;如果该特征向量卷积结果中对应的人脸特征向量不满足人脸收敛条件,则根据特征向量卷积结果中对应的人脸特征向量,调整第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对该第二卷积网络模型进行迭代训练,直至迭代训练后的人脸特征向量满足人脸收敛条件。In this embodiment, if the location information and/or the confidence information of the corresponding service object in the feature vector convolution result does not satisfy the service object convergence condition, the location information of the corresponding service object in the feature vector convolution result is / or confidence information, adjust the network parameters of the second convolutional network model, and iteratively train the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the iteratively trained business object The location information and/or the confidence information satisfy the convergence condition of the service object; if the corresponding face feature vector in the feature vector convolution result does not satisfy the face convergence condition, the corresponding face feature vector in the convolution result according to the feature vector Adjusting the network parameters of the second convolutional network model, and iteratively training the second convolutional network model according to the adjusted network parameters of the second convolutional network model until the face feature vector after the iterative training satisfies the face Convergence conditions.
在一个可选示例中,步骤S406可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的模型训练模块509执行。In an alternative example, step S406 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a model training module 509 executed by the processor.
上述步骤S404~步骤S406的具体处理可以参见上述图3所示实施例中的相关内容,在此不再赘述。For the specific processing of the foregoing steps S404 to S406, refer to the related content in the foregoing embodiment shown in FIG. 3, and details are not described herein again.
通过上述步骤S403~步骤S406的处理可以得到训练完成的第二卷积网络模型。其中,第二卷积网络模型的结构可以参考上述图2所示实施例中第一卷积网络模型的结构,在此不再赘述。Through the processing of the above steps S403 to S406, the trained second convolutional network model can be obtained. For the structure of the second convolutional network model, reference may be made to the structure of the first convolutional network model in the embodiment shown in FIG. 2, and details are not described herein again.
通过上述训练得到的第一卷积网络模型和第二卷积网络模型可以对视频图像进行相应的处理,具体可以包括以下步骤S407~步骤S411。The first convolutional network model and the second convolutional network model obtained by the above training may perform corresponding processing on the video image, and may specifically include the following steps S407 to S411.
在步骤S407,获取当前播放的包含人脸信息的视频图像。In step S407, the currently played video image containing the face information is acquired.
在步骤S408,从视频图像中提取人脸关键点,使用预先训练好的、用于检测图像中人脸动作状态的第一卷积网络模型,并根据视频图像中人脸动作状态确定视频图像中人脸的面部动作。In step S408, a face key point is extracted from the video image, and a first convolutional network model for detecting a face action state in the image is used, and the video image is determined according to the face motion state in the video image. Facial movements of the face.
在一个可选示例中,步骤S407~S408可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的视频图像检测模块501执行。In an optional example, steps S407-S408 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a video image detection module 501 that is executed by the processor.
在步骤S409,当确定检测到的面部动作与对应的预定面部动作相匹配时,提取与检测到的面部动作相应的人脸区域内的人脸特征点。In step S409, when it is determined that the detected facial motion matches the corresponding predetermined facial motion, the facial feature point in the facial region corresponding to the detected facial motion is extracted.
在步骤S410,根据人脸特征点,使用预先训练好的、用于确定业务对象在视频图像中的展现位置的第二卷积网络模型,确定待展现的业务对象在视频图像中的展现位置。In step S410, according to the face feature point, a pre-trained second convolution network model for determining the presentation position of the business object in the video image is used to determine the presentation position of the business object to be presented in the video image.
在一个可选示例中,步骤S409~S410可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的展现位置确定模块502执行。In an alternative example, steps S409-S410 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 502 that is executed by the processor.
在步骤S411,在展现位置采用计算机绘图方式绘制待展现的业务对象。In step S411, the business object to be presented is drawn in a computer drawing manner at the presentation position.
在一个可选示例中,步骤S411可以由处理器调用存储器存储的相应指令执行,也可以由被处理器 运行的业务对象绘制模块503执行。In an optional example, step S411 may be performed by the processor invoking a corresponding instruction stored in the memory, or may be performed by the processor. The running business object drawing module 503 executes.
随着互联网直播和短视频分享的兴起,越来越多的视频以直播或者短视频的方式出现。这类视频常常以人物为主角(单一人物或少量人物),以人物加简单背景为主要场景,观众主要在手机等移动终端上观看。通过本实施例提供的方案,可以实时对视频播放过程中的视频图像进行检测,给出效果较好的广告投放位置,且不影响用户的观看体验,投放效果更好;通过将业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另外,在该展现位置采用计算机绘图方式绘制待展现的业务对象,业务对象与视频图像中的面部动作紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,同时还不会打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,而且能够在一定程度上吸引观众的注意力,提高业务对象的影响力。可以理解,业务对象的投放除了广告之外,还可广泛应用到其他方面,例如教育、咨询、服务等行业,可通过投放娱乐性、赞赏性等业务信息来提高交互效果,改善用户体验。With the rise of Internet live broadcast and short video sharing, more and more videos appear as live or short video. Such videos are often dominated by characters (single characters or a small number of characters), with characters and simple backgrounds as the main scenes, and viewers mainly watch on mobile terminals such as mobile phones. Through the solution provided in this embodiment, the video image in the video playing process can be detected in real time, and the advertising placement position with better effect is given, and the user's viewing experience is not affected, and the delivery effect is better; by the business object and the video The combination of play, no need to transmit additional advertising video data irrelevant to the video through the network, is conducive to saving network resources and/or system resources of the client; in addition, the business object to be presented is drawn by computer drawing at the display position, the business object and The facial actions in the video image are closely combined, which can preserve the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and also does not disturb the user to watch the video normally, which is beneficial to reduce The user's dislike of the business objects displayed in the video image, and can attract the attention of the audience to a certain extent, and improve the influence of the business object. It can be understood that, in addition to advertising, business objects can be widely applied to other aspects, such as education, consulting, services, etc., by providing entertainment, appreciation and other business information to improve interaction and improve user experience.
本申请实施例提供的任一种视频图像的处理可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本申请实施例提供的任一种视频图像的处理可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本申请实施例提及的任一种视频图像的处理。下文不再赘述。The processing of any of the video images provided by the embodiments of the present application may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device, a server, and the like. Alternatively, the processing of any of the video images provided by the embodiments of the present application may be performed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform processing of any one of the video images mentioned in the embodiments of the present application. This will not be repeated below.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
图5是本申请视频图像的处理装置一实施例的结构示意图。本申请各实施例的视频图像的处理装置可用于实现本申请上述各视频图像的处理方法实施例。参照图5,该实施例视频图像的处理装置包括:视频图像检测模块501、展现位置确定模块502和业务对象绘制模块503。其中:FIG. 5 is a schematic structural diagram of an embodiment of a processing apparatus for video images of the present application. The video image processing apparatus of the embodiments of the present application can be used to implement the foregoing method for processing each video image of the present application. Referring to FIG. 5, the processing apparatus for the video image of this embodiment includes: a video image detecting module 501, a presentation position determining module 502, and a business object drawing module 503. among them:
视频图像检测模块501,用于对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测。The video image detecting module 501 is configured to perform facial motion detection of the face on the currently played video image including the face information.
展现位置确定模块502,用于当确定检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在视频图像中的展现位置。The presentation location determining module 502 is configured to determine a presentation location of the business object to be presented in the video image when it is determined that the detected facial motion matches the corresponding predetermined facial motion.
业务对象绘制模块503,用于在展现位置采用计算机绘图方式绘制待展现的业务对象。The business object drawing module 503 is configured to draw a business object to be presented by using a computer drawing manner at the presentation position.
通过本实施例提供的视频图像的处理装置,通过对当前播放的包含人脸信息的视频图像进行面部动作检测,并将检测到的面部动作与对应的预定面部动作进行匹配,当两者相匹配时,确定待展现的业务对象在视频图像中的展现位置,在该展现位置采用计算机绘图方式绘制待展现的业务对象,这样当业务对象用于展示广告时,一方面,由于采用计算机绘图方式在展现位置绘制待展现的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的面部动作紧密结合,既有利于保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,还不会打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,能够在一定程度上吸引观众的注意力,提高业务对象的影响力。The video image processing apparatus provided in this embodiment performs facial motion detection on the currently played video image including the face information, and matches the detected facial motion with the corresponding predetermined facial motion, when the two match Determining a presentation position of the business object to be presented in the video image, and drawing a business object to be presented by using a computer drawing manner at the presentation position, so that when the business object is used for displaying an advertisement, on the one hand, by using a computer drawing method The presentation location draws the business object to be presented, and the business object is combined with the video playback, and does not need to transmit additional advertisement video data irrelevant to the video through the network, thereby saving network resources and/or system resources of the client; on the other hand, the business object Closely combined with the facial motion in the video image, it is beneficial to retain the main image and motion of the video subject (such as the anchor) in the video image, and can add interest to the video image, and does not disturb the user to watch the video normally, which is beneficial to the video. Reduce the business objects that users display on video images Disgusted, to attract the viewer's attention to a certain extent, increase the influence of business objects.
在本申请视频图像的处理装置实施例的一个可选示例中,视频图像检测模块501,用于从当前播放的包含人脸信息的视频图像中提取人脸关键点,使用预先训练好的、用于检测图像中人脸动作状态的第一卷积网络模型,根据该人脸关键点确定视频图像中人脸的面部动作的状态,并根据该人脸的面部动作的状态确定视频图像中人脸的面部动作。In an optional example of the processing device embodiment of the video image of the present application, the video image detecting module 501 is configured to extract a face key point from the currently played video image containing the face information, and use the pre-trained a first convolutional network model for detecting a state of motion of a face in an image, determining a state of a facial motion of a face in the video image according to the key point of the face, and determining a face of the video image according to a state of the facial motion of the facial face Facial movements.
图6是本申请视频图像的处理装置另一实施例的结构示意图。参见图6,与图5所示的实施例相比,该视频图像的处理装置还包括:FIG. 6 is a schematic structural diagram of another embodiment of a processing apparatus for video images of the present application. Referring to FIG. 6, the processing apparatus of the video image further includes:
训练样本获取模块504,用于获取多张包括人脸信息的样本图像作为训练样本,其中,该样本图像标注有人脸属性的信息。The training sample obtaining module 504 is configured to acquire a plurality of sample images including face information, wherein the sample image is labeled with information of a face attribute.
第一卷积网络模型确定模块505,用于使用该训练样本对该第一卷积网络模型进行训练,得到用于 检测图像中人脸动作状态的第一卷积网络模型。a first convolutional network model determining module 505, configured to use the training sample to train the first convolutional network model, and obtain A first convolutional network model that detects the state of the face motion in the image.
可选地,训练样本获取模块504包括:样本图像获取单元,用于获取多张包括人脸信息的样本图像;人脸定位信息确定单元,用于对每张该样本图像,检测样本图像中的人脸和人脸关键点,通过该人脸关键点将样本图像中的人脸进行定位,得到人脸定位信息;训练样本确定单元,用于将包含该人脸定位信息的该样本图像作为训练样本。Optionally, the training sample obtaining module 504 includes: a sample image acquiring unit, configured to acquire a plurality of sample images including face information; and a face positioning information determining unit configured to detect, in each sample image, the sample image a face and a face key point, the face in the sample image is positioned by the face key point to obtain face positioning information; and the training sample determining unit is configured to use the sample image including the face positioning information as a training sample.
可选地,展现位置确定模块502包括:特征点提取单元,用于提取与检测到的面部动作相应的人脸区域内的人脸特征点;展现位置确定单元,用于根据该人脸特征点,确定待展现的业务对象在该视频图像中的展现位置。Optionally, the presentation location determining module 502 includes: a feature point extraction unit, configured to extract a facial feature point in a face region corresponding to the detected facial motion; and a presentation location determining unit, configured to use the facial feature point according to the facial feature point Determining a presentation location of the business object to be presented in the video image.
可选地,展现位置确定模块502,用于根据该人脸特征点,使用预先训练好的、用于确定业务对象在视频图像中的展现位置的第二卷积网络模型,确定待展现的业务对象在该视频图像中的展现位置。Optionally, the presentation location determining module 502 is configured to determine, according to the facial feature point, a pre-trained second convolutional network model for determining a presentation location of the business object in the video image, to determine the service to be presented. The presentation position of the object in the video image.
可选地,再参见图6,又一实施例的视频图像的处理装置还包括:Optionally, referring to FIG. 6, the video image processing apparatus of still another embodiment further includes:
特征向量获取模块506,用于获取训练样本的样本图像的特征向量,其中,该特征向量中包括:样本图像中的业务对象的位置信息和/或置信度信息,以及样本图像中面部动作相应的人脸区域内的人脸特征点对应的人脸特征向量;The feature vector obtaining module 506 is configured to acquire a feature vector of the sample image of the training sample, where the feature vector includes: location information and/or confidence information of the service object in the sample image, and corresponding to the facial motion in the sample image. a face feature vector corresponding to a face feature point in a face region;
卷积模块507,用于对该特征向量进行卷积处理,获取特征向量卷积结果;The convolution module 507 is configured to perform convolution processing on the feature vector to obtain a feature vector convolution result;
收敛条件判断模块508,用于判断特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足业务对象收敛条件,并判断特征向量卷积结果中对应的人脸特征向量是否满足人脸收敛条件;The convergence condition determining module 508 is configured to determine whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determine whether the corresponding face feature vector in the feature vector convolution result is Meet the face convergence condition;
模型训练模块509,用于在上述收敛条件都满足时,即:业务对象的位置信息和/或置信度信息满足业务对象收敛条件、且人脸特征向量满足人脸收敛条件,完成对第二卷积网络模型的训练;否则,在上述收敛条件未都满足时,调整第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对该第二卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息以及人脸特征向量均满足相应的收敛条件。The model training module 509 is configured to: when the convergence conditions are satisfied, that is, the location information and/or the confidence information of the service object satisfy the convergence condition of the service object, and the face feature vector satisfies the face convergence condition, and the second volume is completed. Training of the product network model; otherwise, when the above convergence conditions are not satisfied, the network parameters of the second convolutional network model are adjusted, and the second convolution network model is based on the adjusted network parameters of the second convolutional network model The iterative training is performed until the position information and/or the confidence information of the business object after the iterative training and the face feature vector satisfy the corresponding convergence condition.
可选地,展现位置确定模块502,用于根据该人脸特征点和待展现的业务对象的类型,确定待展现的业务对象在该视频图像中的展现位置。Optionally, the presentation location determining module 502 is configured to determine, according to the facial feature point and the type of the business object to be presented, a presentation location of the business object to be presented in the video image.
可选地,展现位置确定模块502包括:展现位置获取单元,用于根据该人脸特征点和待展现的业务对象的类型,获取待展现的业务对象在该视频图像中的多个展现位置;展现位置选择单元,用于从该多个展现位置中选择至少一个展现位置作为待展现的业务对象在视频图像中的展现位置。Optionally, the presentation location determining module 502 includes: a presentation location obtaining unit, configured to acquire, according to the facial feature point and the type of the business object to be presented, a plurality of presentation locations of the business object to be presented in the video image; And a presentation location selection unit, configured to select at least one presentation location from the plurality of presentation locations as a presentation location of the business object to be presented in the video image.
可选地,展现位置确定模块502,用于从预先存储的面部动作与展现位置的对应关系中,获取该预定面部动作对应的目标展现位置作为该待展现的业务对象在该视频图像中的展现位置。Optionally, the presentation location determining module 502 is configured to obtain, from a correspondence between the pre-stored facial motion and the presentation location, a target presentation location corresponding to the predetermined facial motion as a presentation of the business object to be presented in the video image. position.
可选地,该业务对象包括:包含有语义信息的特效;视频图像为直播类视频图像。Optionally, the business object includes: an effect including semantic information; the video image is a live video image.
可选地,上述包含有语义信息的特效包含广告信息的以下至少一种形式的特效:二维贴纸特效、三维特效、粒子特效等。Optionally, the foregoing special effect including the semantic information includes at least one of the following special effects of the advertisement information: a two-dimensional sticker special effect, a three-dimensional special effect, a particle special effect, and the like.
可选地,展示位置包括以下至少之一或任意多个:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域等。Optionally, the display location comprises at least one or any of the following: a hair area of a person in the video image, a forehead area, a cheek area, a chin area, a body area other than the head, a background area in the video image, and a video image An area within the setting range centering on the area where the hand is located, a predetermined area in the video image, and the like.
可选地,业务对象的类型包括以下至少之一或任意多种类型:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型、背景类型、虚拟宠物类型、虚拟容器类型等。Optionally, the type of the business object includes at least one of the following or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, a virtual headwear type, Virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type, and the like.
可选地,人脸的面部动作包括以下至少之一或任意多个:眨眼、亲吻、张嘴、摇头、点头、笑、哭、皱眉、闭左眼、闭右眼、闭双眼、眼珠向左运动、眼珠向右运动、向左转头、向右转头、仰头、低头、嘟嘴等。Optionally, the facial motion of the face includes at least one or any of the following: blinking, kissing, opening the mouth, shaking the head, nodding, laughing, crying, frowning, closing the left eye, closing the right eye, closing the eyes, moving the eye to the left The eyeball moves to the right, turns to the left, turns to the right, looks up, bows, and pouts.
参照图7,示出了根据本申请实施例七的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。如图7所示,该电子设备可以包括:处理器(processor)902、通信接口(Communications Interface)904、存储器(memory)906、以及通信总线908。其中: Referring to FIG. 7, a schematic structural diagram of an electronic device according to Embodiment 7 of the present application is shown. The specific embodiment of the present application does not limit the specific implementation of the electronic device. As shown in FIG. 7, the electronic device can include a processor 902, a communications interface 904, a memory 906, and a communications bus 908. among them:
处理器902、通信接口904、以及存储器906通过通信总线908完成相互间的通信。Processor 902, communication interface 904, and memory 906 complete communication with one another via communication bus 908.
通信接口904,用于与其它设备比如其它客户端或服务器等的网元通信。The communication interface 904 is configured to communicate with network elements of other devices, such as other clients or servers.
处理器702可能是中央处理器(CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本申请实施例的一个或多个集成电路,或者是图形处理器(Graphics Processing Unit,GPU)。终端设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU,或者,一个或多个GPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个GPU。The processor 702 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, or a graphics processor ( Graphics Processing Unit, GPU). The one or more processors included in the terminal device may be the same type of processor, such as one or more CPUs, or one or more GPUs; or may be different types of processors, such as one or more CPUs and One or more GPUs.
存储器906,用于至少一可执行指令,该可执行指令使处理器902执行如本申请上述任一实施例在视频图像中展示业务对象的方法对应的操作。存储器906可能包含高速随机存取存储器(random access memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 906 is for at least one executable instruction that causes the processor 902 to perform operations corresponding to a method of presenting a business object in a video image as in any of the above-described embodiments of the present application. The memory 906 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.
另外,本申请实施例还提供了另一种电子设备,包括:处理器和本申请上述任一实施例所述的视频图像的处理装置;在处理器运行所述视频图像的处理装置时,本申请上述任一实施例所述的视频图像的处理装置中的单元被运行。In addition, the embodiment of the present application further provides another electronic device, including: a processor and a video image processing apparatus according to any one of the foregoing embodiments; when the processor runs the video image processing device, The unit in the processing apparatus applying for the video image described in any of the above embodiments is operated.
图8为本发明电子设备另一实施例的结构示意图。下面参考图8,其示出了适于用来实现本申请实施例的终端设备或服务器的电子设备的结构示意图。如图8所示,该电子设备包括一个或多个处理器、通信部等,所述一个或多个处理器例如:一个或多个中央处理单元(CPU)801,和/或一个或多个图像处理器(GPU)813等,处理器可以根据存储在只读存储器(ROM)802中的可执行指令或者从存储部分808加载到随机访问存储器(RAM)803中的可执行指令而执行各种适当的动作和处理。通信部812可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡,处理器可与只读存储器802和/或随机访问存储器803中通信以执行可执行指令,通过总线804与通信部812相连、并经通信部812与其他目标设备通信,从而完成本申请实施例提供的任一视频图像的处理方法对应的操作,例如,对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测;当检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在所述视频图像中的展现位置;在所述展现位置采用计算机绘图方式绘制所述待展现的业务对象。FIG. 8 is a schematic structural diagram of another embodiment of an electronic device according to the present invention. Referring to FIG. 8, there is shown a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server of an embodiment of the present application. As shown in FIG. 8, the electronic device includes one or more processors, communication units, etc., such as one or more central processing units (CPUs) 801, and/or one or more An image processor (GPU) 813 or the like, the processor may execute various kinds according to executable instructions stored in a read only memory (ROM) 802 or executable instructions loaded from the storage portion 808 into the random access memory (RAM) 803. Proper action and handling. Communication portion 812 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card, and the processor can communicate with read only memory 802 and/or random access memory 803 to execute executable instructions over bus 804. The communication unit 812 is connected to the communication unit 812 and communicates with other target devices to complete the operation corresponding to the processing method of any video image provided by the embodiment of the present application. For example, the currently played video image including the face information is performed. Facial motion detection of a face; determining a presentation position of a business object to be presented in the video image when the detected facial motion matches a corresponding predetermined facial motion; drawing a computer drawing manner at the presentation position The business object to be presented.
此外,在RAM 803中,还可存储有装置操作所需的各种程序和数据。CPU801、ROM802以及RAM803通过总线804彼此相连。在有RAM803的情况下,ROM802为可选模块。RAM803存储可执行指令,或在运行时向ROM802中写入可执行指令,可执行指令使处理器801执行上述视频图像的处理方法对应的操作。输入/输出(I/O)接口805也连接至总线804。通信部812可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。Further, in the RAM 803, various programs and data required for the operation of the device can be stored. The CPU 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804. In the case of RAM 803, ROM 802 is an optional module. The RAM 803 stores executable instructions or writes executable instructions to the ROM 802 at runtime, the executable instructions causing the processor 801 to perform operations corresponding to the processing methods of the video images described above. An input/output (I/O) interface 805 is also coupled to bus 804. The communication unit 812 may be integrated or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards) and on the bus link.
以下部件连接至I/O接口805:包括键盘、鼠标等的输入部分806;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分807;包括硬盘等的存储部分808;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器811也根据需要连接至I/O接口805。可拆卸介质811,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器811上,以便于从其上读出的计算机程序根据需要被安装入存储部分808。The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, etc.; an output portion 807 including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 808 including a hard disk or the like. And a communication portion 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the Internet. Driver 811 is also connected to I/O interface 805 as needed. A removable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like, is mounted on the drive 811 as needed so that a computer program read therefrom is installed into the storage portion 808 as needed.
需要说明的,如图8所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图8的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,通信部可分离设置,也可集成设置在CPU或GPU上,等等。这些可替换的实施方式均落入本发明公开的保护范围。It should be noted that the architecture shown in FIG. 8 is only an optional implementation manner. In a specific practice, the number and types of the components in FIG. 8 may be selected, deleted, added, or replaced according to actual needs; Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more. These alternative embodiments are all within the scope of the present disclosure.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测;当检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在所述视频图像中的展现位置; 在所述展现位置采用计算机绘图方式绘制所述待展现的业务对象。In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with an embodiment of the present disclosure. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing an instruction corresponding to the method step provided by the embodiment of the present application, for example, performing a facial motion detection of a face on a currently played video image including face information; and when the detected facial motion matches a corresponding predetermined facial motion, Determining a presentation location of the business object to be presented in the video image; The business object to be presented is drawn in a computer drawing manner at the presentation position.
另外,本申请实施例还提供了一种计算机程序,包括计算机可读代码,当该计算机可读代码在设备上运行时,设备中的处理器执行用于实现本申请任一实施例所述的视频图像的处理方法中各步骤的指令。In addition, the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device performs the implementation of any embodiment of the present application. Instructions for each step in the processing method of the video image.
另外,本申请实施例还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,该指令被执行时实现本申请任一实施例视频图像的处理方法中各步骤的操作。In addition, the embodiment of the present application further provides a computer readable storage medium for storing computer readable instructions, which are executed to implement the operations of the steps in the video image processing method of any embodiment of the present application.
本申请实施例中,计算机程序、计算机可读取的指令被执行时各步骤的具体实现可以参见上述实施例中的相应步骤和模块中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。For the specific implementation of the steps of the computer program and the computer readable command, the corresponding steps in the above-mentioned embodiments and corresponding descriptions in the modules are not described herein. A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the device and the module described above may be referred to the corresponding process description in the foregoing method embodiment, and details are not described herein again.
本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于装置、电子设备、程序、存储介质等实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. For a device, an electronic device, a program, a storage medium, and the like, the description is relatively simple because it basically corresponds to the method embodiment. For related parts, refer to the description of the method embodiment.
需要指出,根据实施的需要,可将本申请中描述的各个步骤/部件拆分为更多步骤/部件,也可将两个或多个步骤/部件或者步骤/部件的部分操作组合成新的步骤/部件,以实现本申请的目的。It should be pointed out that the various steps/components described in the present application can be split into more steps/components according to the needs of the implementation, and two or more steps/components or partial operations of the steps/components can be combined into new ones. Steps/components to achieve the objectives of the present application.
上述根据本申请的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的处理方法。此外,当通用计算机访问用于实现在此示出的处理的代码时,代码的执行将通用计算机转换为用于执行在此示出的处理的专用计算机。The above method according to the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network. The computer code originally stored in a remote recording medium or non-transitory machine readable medium and to be stored in a local recording medium, whereby the methods described herein can be stored using a general purpose computer, a dedicated processor, or programmable or dedicated Such software processing on a recording medium of hardware such as an ASIC or an FPGA. It will be understood that a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is The processing methods described herein are implemented when the processor or hardware is accessed and executed. Moreover, when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code converts the general purpose computer into a special purpose computer for performing the processing shown herein.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。 The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims (34)

  1. 一种视频图像的处理方法,其特征在于,包括:A method for processing a video image, comprising:
    对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测;Performing a face motion detection of a face on a currently played video image containing face information;
    当检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在所述视频图像中的展现位置;Determining a presentation position of the business object to be presented in the video image when the detected facial motion matches the corresponding predetermined facial motion;
    在所述展现位置采用计算机绘图方式绘制所述待展现的业务对象。The business object to be presented is drawn in a computer drawing manner at the presentation position.
  2. 根据权利要求1所述的方法,其特征在于,所述对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测,包括:The method according to claim 1, wherein the performing facial motion detection of the face on the currently played video image containing the face information comprises:
    从当前播放的包含人脸信息的视频图像中提取人脸关键点,使用预先训练好的、用于检测图像中人脸动作状态的第一卷积网络模型,根据所述人脸关键点确定所述视频图像中的人脸动作状态,并根据视频图像中的人脸动作状态确定所述视频图像中人脸的面部动作。Extracting a face key point from a currently played video image containing face information, using a pre-trained first convolutional network model for detecting a face action state in the image, determining the location according to the face key point The face action state in the video image is determined, and the face action of the face in the video image is determined according to the face action state in the video image.
  3. 根据权利要求2所述的方法,其特征在于,对所述第一卷积网络模型进行预先训练,包括:The method of claim 2, wherein pre-training the first convolutional network model comprises:
    获取多张包括人脸信息的样本图像作为训练样本,其中,所述样本图像标注有人脸动作状态的信息;Obtaining a plurality of sample images including face information, wherein the sample images are labeled with information of a human face action state;
    使用所述训练样本对所述第一卷积网络模型进行训练,得到用于检测图像中人脸动作状态的第一卷积网络模型。The first convolutional network model is trained using the training samples to obtain a first convolutional network model for detecting a state of facial motion in the image.
  4. 根据权利要求3所述的方法,其特征在于,获取多张包括人脸信息的样本图像作为训练样本,包括:The method according to claim 3, wherein acquiring a plurality of sample images including face information as training samples comprises:
    获取多张包括人脸信息的样本图像;Obtaining a plurality of sample images including face information;
    对每张所述样本图像,检测样本图像中的人脸和人脸关键点,通过所述人脸关键点对样本图像中的人脸进行定位,得到人脸定位信息;For each of the sample images, detecting face and face key points in the sample image, and positioning the face in the sample image by the face key point to obtain face positioning information;
    以包含所述人脸定位信息的所述样本图像作为训练样本。The sample image including the face location information is used as a training sample.
  5. 根据权利要求1-4任一所述的方法,其特征在于,所述确定待展现的业务对象在所述视频图像中的展现位置,包括:The method according to any one of claims 1-4, wherein the determining a presentation location of the business object to be presented in the video image comprises:
    获取与检测到的面部动作相应的人脸区域内的人脸特征点;Obtaining a facial feature point in a face region corresponding to the detected facial motion;
    根据所述人脸特征点,确定所述待展现的业务对象在所述视频图像中的展现位置。Determining, according to the face feature point, a presentation location of the business object to be presented in the video image.
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述人脸特征点,确定所述待展现的业务对象在所述视频图像中的展现位置,包括:The method according to claim 5, wherein the determining, according to the facial feature point, the presentation location of the business object to be presented in the video image comprises:
    根据所述人脸特征点,使用预先训练好的、用于确定业务对象在视频图像中的展现位置的第二卷积网络模型,确定所述待展现的业务对象在所述视频图像中的展现位置。Determining, according to the face feature point, a pre-trained second convolution network model for determining a presentation position of the business object in the video image, the presentation of the business object to be presented in the video image position.
  7. 根据权利要求6所述的方法,其特征在于,对所述第二卷积网络模型的预先训练,包括:The method of claim 6 wherein the pre-training of the second convolutional network model comprises:
    获取训练样本的样本图像的特征向量,其中,所述特征向量中包括:所述样本图像中的业务对象的位置信息和/或置信度信息,以及样本图像中面部动作相应的人脸区域内的人脸特征点对应的人脸特征向量;Obtaining a feature vector of the sample image of the training sample, where the feature vector includes: location information and/or confidence information of the business object in the sample image, and a face region corresponding to the facial action in the sample image a face feature vector corresponding to a face feature point;
    对所述特征向量进行卷积处理,获取特征向量卷积结果;Convoluting the feature vector to obtain a feature vector convolution result;
    判断所述特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足业务对象收敛条件,并判断所述特征向量卷积结果中对应的人脸特征向量是否满足人脸收敛条件;Determining whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition of the service object, and determining whether the corresponding face feature vector in the feature vector convolution result satisfies the face convergence condition;
    若都满足,则完成对所述第二卷积网络模型的训练;If all are satisfied, completing training on the second convolutional network model;
    否则,调整所述第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对所述第二卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息以及所述人脸特征向量均满足相应的收敛条件。Otherwise, the network parameters of the second convolutional network model are adjusted, and the second convolutional network model is iteratively trained according to the adjusted network parameters of the second convolutional network model until the iteratively trained business object The location information and/or the confidence information and the face feature vector both satisfy respective convergence conditions.
  8. 根据权利要求5所述的方法,其特征在于,所述根据所述人脸特征点,确定所述待展现的业务对象在所述视频图像中的展现位置,包括: The method according to claim 5, wherein the determining, according to the facial feature point, the presentation location of the business object to be presented in the video image comprises:
    根据所述人脸特征点和所述待展现的业务对象的类型,确定所述待展现的业务对象在所述视频图像中的展现位置。Determining, according to the face feature point and the type of the business object to be presented, a presentation location of the business object to be presented in the video image.
  9. 根据权利要求8所述的方法,其特征在于,根据所述人脸特征点和所述待展现的业务对象的类型,确定所述待展现的业务对象在所述视频图像中的展现位置,包括:The method according to claim 8, wherein the display location of the business object to be presented in the video image is determined according to the type of the face feature point and the type of the business object to be presented, including :
    根据所述人脸特征点和所述待展现的业务对象的类型,获取所述待展现的业务对象在所述视频图像中的多个展现位置;Acquiring, according to the face feature point and the type of the business object to be presented, a plurality of display positions of the business object to be presented in the video image;
    从所述多个展现位置中选择至少一个展现位置作为所述待展现的业务对象在所述视频图像中的展现位置。Selecting at least one presentation location from the plurality of presentation locations as a presentation location of the business object to be presented in the video image.
  10. 根据权利要求1-4任一所述的方法,其特征在于,所述确定待展现的业务对象在所述视频图像中的展现位置,包括:The method according to any one of claims 1-4, wherein the determining a presentation location of the business object to be presented in the video image comprises:
    从预先存储的面部动作与展现位置的对应关系中,获取所述预定面部动作对应的目标展现位置作为所述待展现的业务对象在所述视频图像中的展现位置。The target presentation position corresponding to the predetermined facial motion is obtained as a presentation position of the business object to be presented in the video image from a correspondence relationship between the pre-stored facial motion and the presentation position.
  11. 根据权利要求1-10任一所述的方法,其特征在于,所述业务对象包括:包含有语义信息的特效;所述视频图像包括:直播类视频图像。The method according to any one of claims 1 to 10, wherein the business object comprises: an effect comprising semantic information; and the video image comprises: a live video type image.
  12. 根据权利要求11所述的方法,其特征在于,所述包含有语义信息的特效包括:包含广告信息的以下一种或任意多种形式的特效:二维贴纸特效、三维特效、粒子特效。The method according to claim 11, wherein the special effect including the semantic information comprises: one or any of a plurality of forms of special effects including the two-dimensional sticker effect, the three-dimensional special effect, and the particle special effect.
  13. 根据权利要求1-12任一所述的方法,其特征在于,所述展示位置包括以下一个或任意多个:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域。The method according to any one of claims 1 to 12, wherein the display position comprises one or more of the following: a hair area of a person in a video image, a forehead area, a cheek area, a chin area, and a head The body area, the background area in the video image, the area within the set range of the video image centered on the area where the hand is located, and the area preset in the video image.
  14. 根据权利要求1-13任一所述的方法,其特征在于,所述业务对象的类型包括以下一种或任意多种类型:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型、背景类型、虚拟宠物类型、虚拟容器类型。The method according to any one of claims 1 to 13, wherein the type of the business object comprises one or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat. Type, virtual clothing type, virtual makeup type, virtual headdress type, virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type.
  15. 根据权利要求1-14任一所述的方法,其特征在于,所述人脸的面部动作包括以下一个或任意多个:眨眼、亲吻、张嘴、摇头、点头、笑、哭、皱眉、闭左眼、闭右眼、闭双眼、眼珠向左运动、眼珠向右运动、向左转头、向右转头、仰头、低头、嘟嘴。The method according to any one of claims 1 to 14, wherein the facial motion of the human face comprises one or more of the following: blinking, kissing, opening a mouth, shaking his head, nodding, laughing, crying, frowning, closing the left Eyes, closed right eyes, closed eyes, eyeballs move to the left, eyeballs move to the right, turn left, turn to the right, raise your head, bow your head, pout.
  16. 一种视频图像的处理装置,其特征在于,包括:A video image processing apparatus, comprising:
    视频图像检测模块,用于对当前播放的包含人脸信息的视频图像进行人脸的面部动作检测;a video image detecting module, configured to perform facial motion detection of a face on a currently played video image including face information;
    展现位置确定模块,用于当所述视频图像检测模块检测到的面部动作与对应的预定面部动作相匹配时,确定待展现的业务对象在所述视频图像中的展现位置;a presentation location determining module, configured to determine a presentation location of the business object to be presented in the video image when the facial motion detected by the video image detection module matches a corresponding predetermined facial motion;
    业务对象绘制模块,在所述展现位置采用计算机绘图方式绘制所述待展现的业务对象。The business object drawing module draws the business object to be presented in a computer drawing manner at the presentation position.
  17. 根据权利要求16所述的装置,其特征在于,所述视频图像检测模块,用于从当前播放的包含人脸信息的视频图像中获取人脸关键点,使用预先训练好的、用于检测图像中的人脸动作状态的第一卷积网络模型,根据所述人脸关键点确定所述视频图像中人脸的面部动作的状态,并根据视频图像中的人脸动作状态确定所述视频图像中人脸的面部动作。The device according to claim 16, wherein the video image detecting module is configured to acquire a face key point from a currently played video image containing face information, and use the pre-trained image for detecting the image. a first convolutional network model of a face motion state, determining a state of a facial motion of the face in the video image according to the face key point, and determining the video image according to a face motion state in the video image The facial movement of the face.
  18. 根据权利要求17所述的装置,其特征在于,还包括:The device according to claim 17, further comprising:
    训练样本获取模块,用于获取多张包括人脸信息的样本图像作为训练样本,其中,所述样本图像标注有人脸动作状态的信息;a training sample obtaining module, configured to acquire a plurality of sample images including face information, wherein the sample image is labeled with information of a human face action state;
    第一卷积网络模型确定模块,用于使用所述训练样本对所述第一卷积网络模型进行训练,得到用于检测图像中人脸动作状态的第一卷积网络模型。A first convolutional network model determining module is configured to train the first convolutional network model using the training samples to obtain a first convolutional network model for detecting a state of a face action in an image.
  19. 根据权利要求18所述的装置,其特征在于,所述训练样本获取模块包括:The apparatus according to claim 18, wherein the training sample acquisition module comprises:
    样本图像获取单元,用于获取多张包括人脸信息的样本图像;a sample image obtaining unit, configured to acquire a plurality of sample images including face information;
    人脸定位信息确定单元,用于对每张所述样本图像,检测样本图像中的人脸和人脸关键点,通过所述人脸关键点对样本图像中的人脸进行定位,得到人脸定位信息;a face positioning information determining unit, configured to detect a face and a face key point in the sample image for each of the sample images, and locate a face in the sample image by the face key point to obtain a face Positioning information;
    训练样本确定单元,用于将包含所述人脸定位信息的所述样本图像作为训练样本。A training sample determining unit is configured to use the sample image including the face positioning information as a training sample.
  20. 根据权利要求16-19任一所述的装置,其特征在于,所述展现位置确定模块,包括: The device according to any one of claims 16 to 19, wherein the presentation location determining module comprises:
    特征点提取单元,用于获取与检测到的面部动作相应的人脸区域内的人脸特征点;a feature point extracting unit, configured to acquire a face feature point in a face region corresponding to the detected facial motion;
    展现位置确定单元,用于根据所述人脸特征点,确定所述待展现的业务对象在所述视频图像中的展现位置。And a presentation location determining unit, configured to determine, according to the facial feature point, a presentation location of the business object to be presented in the video image.
  21. 根据权利要求20所述的装置,其特征在于,所述展现位置确定模块,用于根据所述人脸特征点,使用预先训练好的、用于确定业务对象在视频图像中的展现位置的第二卷积网络模型,确定所述待展现的业务对象在所述视频图像中的展现位置。The device according to claim 20, wherein the presentation position determining module is configured to use, according to the facial feature point, a pre-trained first for determining a presentation position of a business object in a video image. A two-convolution network model determines a presentation location of the business object to be presented in the video image.
  22. 根据权利要求21所述的装置,其特征在于,还包括:The device according to claim 21, further comprising:
    特征向量获取模块,用于获取训练样本的样本图像的特征向量,其中,所述特征向量中包括:所述样本图像中的业务对象的位置信息和/或置信度信息,以及样本图像中面部动作相应的人脸区域内的人脸特征点对应的人脸特征向量;a feature vector obtaining module, configured to acquire a feature vector of a sample image of the training sample, where the feature vector includes: location information and/or confidence information of the business object in the sample image, and facial motion in the sample image a face feature vector corresponding to a face feature point in the corresponding face region;
    卷积模块,用于对所述特征向量进行卷积处理,获取特征向量卷积结果;a convolution module, configured to perform convolution processing on the feature vector to obtain a feature vector convolution result;
    收敛条件判断模块,用于判断所述特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足业务对象收敛条件,并判断所述特征向量卷积结果中对应的人脸特征向量是否满足人脸收敛条件;a convergence condition determining module, configured to determine whether location information and/or confidence information of a corresponding service object in the feature vector convolution result satisfies a service object convergence condition, and determine a corresponding face in the feature vector convolution result Whether the feature vector satisfies the face convergence condition;
    模型训练模块,用于若都满足,则完成对所述第二卷积网络模型的训练;否则,调整第二卷积网络模型的网络参数,并根据调整后的第二卷积网络模型的网络参数对所述第二卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息以及所述人脸特征向量均满足相应的收敛条件。a model training module, configured to perform training on the second convolutional network model if satisfied; otherwise, adjusting network parameters of the second convolutional network model and according to the adjusted network of the second convolutional network model The parameter performs iterative training on the second convolutional network model until the position information and/or the confidence information of the service object after the iterative training and the face feature vector satisfy the corresponding convergence condition.
  23. 根据权利要求20所述的装置,其特征在于,所述展现位置确定模块,用于根据所述人脸特征点和所述待展现的业务对象的类型,确定所述待展现的业务对象在所述视频图像中的展现位置。The device according to claim 20, wherein the presentation location determining module is configured to determine, according to the face feature point and the type of the business object to be presented, the business object to be presented The presentation position in the video image.
  24. 根据权利要求23所述的装置,其特征在于,所述展现位置确定模块包括:The device according to claim 23, wherein the presentation location determining module comprises:
    展现位置获取单元,用于根据所述人脸特征点和所述待展现的业务对象的类型,获取待展现的业务对象在所述视频图像中的多个展现位置;a presentation location obtaining unit, configured to acquire, according to the face feature point and the type of the business object to be presented, a plurality of presentation locations of the business object to be presented in the video image;
    展现位置选择单元,用于从所述多个展现位置中选择至少一个展现位置。A presentation location selection unit is configured to select at least one presentation location from the plurality of presentation locations.
  25. 根据权利要求16-19任一所述的装置,其特征在于,所述展现位置确定模块,用于从预先存储的面部动作与展现位置的对应关系中,获取所述预定面部动作对应的目标展现位置作为所述待展现的业务对象在所述视频图像中的展现位置作为所述待展现的业务对象在所述视频图像中的展现位置。The apparatus according to any one of claims 16 to 19, wherein the presentation position determining module is configured to acquire a target presentation corresponding to the predetermined facial motion from a correspondence between a pre-stored facial motion and a presentation position. The location as the presentation location of the business object to be presented in the video image is used as a presentation location of the business object to be presented in the video image.
  26. 根据权利要求16-25任一所述的装置,其特征在于,所述业务对象包括:包含有语义信息的特效;所述视频图像包括:直播类视频图像。The device according to any one of claims 16-25, wherein the business object comprises: an effect comprising semantic information; and the video image comprises: a live video image.
  27. 根据权利要求26所述的装置,其特征在于,所述包含有语义信息的特效包括包含广告信息的以下一种或任意多种形式的特效:二维贴纸特效、三维特效、粒子特效。The apparatus according to claim 26, wherein the special effect including the semantic information comprises one or any of the following special effects including the advertisement information: a two-dimensional sticker effect, a three-dimensional special effect, and a particle special effect.
  28. 根据权利要求16-27任一所述的装置,其特征在于,所述展示位置包括以下一个或任意多个:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域。The device according to any one of claims 16-27, wherein the display position comprises one or more of the following: a hair area of a person in the video image, a forehead area, a cheek area, a chin area, and a head. The body area, the background area in the video image, the area within the set range of the video image centered on the area where the hand is located, and the area preset in the video image.
  29. 根据权利要求16-28任一所述的装置,其特征在于,所述业务对象的类型包括以下一种或任意多种类型:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型、背景类型、虚拟宠物类型、虚拟容器类型。The device according to any one of claims 16-28, wherein the type of the business object comprises one or any of the following types: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat. Type, virtual clothing type, virtual makeup type, virtual headdress type, virtual hair accessory type, virtual jewelry type, background type, virtual pet type, virtual container type.
  30. 根据权利要求16-29任一所述的装置,其特征在于,所述人脸的面部动作包括以下一个或任意多个:眨眼、亲吻、张嘴、摇头、点头、笑、哭、皱眉、闭左/右/双眼、嘟嘴。The device according to any one of claims 16-29, wherein the facial motion of the human face comprises one or more of the following: blinking, kissing, opening a mouth, shaking his head, nodding, laughing, crying, frowning, closing the left / Right / eyes, pout.
  31. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;An electronic device comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete communication with each other through the communication bus;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-15任一所述的视频图像的处理方法对应的操作。The memory is configured to store at least one executable instruction that causes the processor to perform an operation corresponding to the method of processing a video image according to any of claims 1-15.
  32. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    处理器和权利要求16-30任一所述的视频图像的处理装置; A processor and a processing apparatus for a video image according to any of claims 16-30;
    在处理器运行所述视频图像的处理装置时,权利要求16-30任一所述的视频图像的处理装置中的单元被运行。The unit in the processing apparatus of the video image of any of claims 16-30 is operated while the processor is running the processing device of the video image.
  33. 一种计算机程序,包括计算机可读代码,其特征在于,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现权利要求1-15任一所述的视频图像的处理方法中各步骤的指令。A computer program comprising computer readable code, wherein a processor in the device executes a video for implementing any of claims 1-15 when the computer readable code is run on a device The instructions for each step in the image processing method.
  34. 一种计算机可读存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现权利要求1-15任一所述的视频图像的处理方法中各步骤的操作。 A computer readable storage medium for storing computer readable instructions, wherein the instructions are executed to perform the operations of the steps of the video image processing method of any of claims 1-15.
PCT/CN2017/098201 2016-08-19 2017-08-19 Video image processing method, apparatus and electronic device WO2018033155A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610697502.3A CN107341435A (en) 2016-08-19 2016-08-19 Processing method, device and the terminal device of video image
CN201610697502.3 2016-08-19

Publications (1)

Publication Number Publication Date
WO2018033155A1 true WO2018033155A1 (en) 2018-02-22

Family

ID=60222304

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/098201 WO2018033155A1 (en) 2016-08-19 2017-08-19 Video image processing method, apparatus and electronic device

Country Status (2)

Country Link
CN (1) CN107341435A (en)
WO (1) WO2018033155A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344770A (en) * 2018-09-30 2019-02-15 新华三大数据技术有限公司 Resource allocation methods and device
CN110991220A (en) * 2019-10-15 2020-04-10 北京海益同展信息科技有限公司 Egg detection method, egg image processing method, egg detection device, egg image processing device, electronic equipment and storage medium
CN111582184A (en) * 2020-05-11 2020-08-25 汉海信息技术(上海)有限公司 Page detection method, device, equipment and storage medium
CN112183200A (en) * 2020-08-25 2021-01-05 中电海康集团有限公司 Eye movement tracking method and system based on video image
CN112434578A (en) * 2020-11-13 2021-03-02 浙江大华技术股份有限公司 Mask wearing normative detection method and device, computer equipment and storage medium
CN113780164A (en) * 2021-09-09 2021-12-10 福建天泉教育科技有限公司 Head posture recognition method and terminal
CN113946221A (en) * 2021-11-03 2022-01-18 广州繁星互娱信息科技有限公司 Eye driving control method and device, storage medium and electronic equipment

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108259496B (en) 2018-01-19 2021-06-04 北京市商汤科技开发有限公司 Method and device for generating special-effect program file package and special effect, and electronic equipment
CN108388434B (en) 2018-02-08 2021-03-02 北京市商汤科技开发有限公司 Method and device for generating special-effect program file package and special effect, and electronic equipment
CN108510523A (en) * 2018-03-16 2018-09-07 新智认知数据服务有限公司 It is a kind of to establish the model for obtaining object feature and object searching method and device
CN110314379B (en) * 2018-03-29 2022-07-26 腾讯科技(深圳)有限公司 Learning method of action output deep training model and related equipment
CN109035257B (en) * 2018-07-02 2021-08-31 百度在线网络技术(北京)有限公司 Portrait segmentation method, device and equipment
CN109068053B (en) * 2018-07-27 2020-12-04 香港乐蜜有限公司 Image special effect display method and device and electronic equipment
CN109165571B (en) * 2018-08-03 2020-04-24 北京字节跳动网络技术有限公司 Method and apparatus for inserting image
CN111488773B (en) * 2019-01-29 2021-06-11 广州市百果园信息技术有限公司 Action recognition method, device, equipment and storage medium
CN110189172B (en) * 2019-05-28 2021-10-15 广州华多网络科技有限公司 Shopping guide method and system for network live broadcast room
CN110188712B (en) * 2019-06-03 2021-10-12 北京字节跳动网络技术有限公司 Method and apparatus for processing image
CN110334698A (en) * 2019-08-30 2019-10-15 上海聚虹光电科技有限公司 Glasses detection system and method
CN110942005A (en) * 2019-11-21 2020-03-31 网易(杭州)网络有限公司 Object recognition method and device
CN112887631B (en) * 2019-11-29 2022-08-12 北京字节跳动网络技术有限公司 Method and device for displaying object in video, electronic equipment and computer-readable storage medium
CN111743524A (en) * 2020-06-19 2020-10-09 联想(北京)有限公司 Information processing method, terminal and computer readable storage medium
CN114051166B (en) * 2020-07-24 2024-03-29 北京达佳互联信息技术有限公司 Method, device, electronic equipment and storage medium for implanting advertisement in video
CN111931762B (en) * 2020-09-25 2021-07-30 广州佰锐网络科技有限公司 AI-based image recognition solution method, device and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1645413A (en) * 2004-01-19 2005-07-27 日本电气株式会社 Image processing apparatus, method and program
CN102455898A (en) * 2010-10-29 2012-05-16 张明 Cartoon expression based auxiliary entertainment system for video chatting
CN102737331A (en) * 2010-12-02 2012-10-17 微软公司 Targeting advertisements based on emotion

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9251854B2 (en) * 2011-02-18 2016-02-02 Google Inc. Facial detection, recognition and bookmarking in videos
CN104881660B (en) * 2015-06-17 2018-01-09 吉林纪元时空动漫游戏科技集团股份有限公司 The expression recognition and interactive approach accelerated based on GPU
CN105426850B (en) * 2015-11-23 2021-08-31 深圳市商汤科技有限公司 Associated information pushing device and method based on face recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1645413A (en) * 2004-01-19 2005-07-27 日本电气株式会社 Image processing apparatus, method and program
CN102455898A (en) * 2010-10-29 2012-05-16 张明 Cartoon expression based auxiliary entertainment system for video chatting
CN102737331A (en) * 2010-12-02 2012-10-17 微软公司 Targeting advertisements based on emotion

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344770A (en) * 2018-09-30 2019-02-15 新华三大数据技术有限公司 Resource allocation methods and device
CN109344770B (en) * 2018-09-30 2020-10-09 新华三大数据技术有限公司 Resource allocation method and device
CN110991220A (en) * 2019-10-15 2020-04-10 北京海益同展信息科技有限公司 Egg detection method, egg image processing method, egg detection device, egg image processing device, electronic equipment and storage medium
CN110991220B (en) * 2019-10-15 2023-11-07 京东科技信息技术有限公司 Egg detection and image processing method and device, electronic equipment and storage medium
CN111582184A (en) * 2020-05-11 2020-08-25 汉海信息技术(上海)有限公司 Page detection method, device, equipment and storage medium
CN111582184B (en) * 2020-05-11 2024-02-20 汉海信息技术(上海)有限公司 Page detection method, device, equipment and storage medium
CN112183200A (en) * 2020-08-25 2021-01-05 中电海康集团有限公司 Eye movement tracking method and system based on video image
CN112183200B (en) * 2020-08-25 2023-10-17 中电海康集团有限公司 Eye movement tracking method and system based on video image
CN112434578B (en) * 2020-11-13 2023-07-25 浙江大华技术股份有限公司 Mask wearing normalization detection method, mask wearing normalization detection device, computer equipment and storage medium
CN112434578A (en) * 2020-11-13 2021-03-02 浙江大华技术股份有限公司 Mask wearing normative detection method and device, computer equipment and storage medium
CN113780164B (en) * 2021-09-09 2023-04-28 福建天泉教育科技有限公司 Head gesture recognition method and terminal
CN113780164A (en) * 2021-09-09 2021-12-10 福建天泉教育科技有限公司 Head posture recognition method and terminal
CN113946221A (en) * 2021-11-03 2022-01-18 广州繁星互娱信息科技有限公司 Eye driving control method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN107341435A (en) 2017-11-10

Similar Documents

Publication Publication Date Title
WO2018033155A1 (en) Video image processing method, apparatus and electronic device
WO2018033143A1 (en) Video image processing method, apparatus and electronic device
WO2018033154A1 (en) Gesture control method, device, and electronic apparatus
US10776970B2 (en) Method and apparatus for processing video image and computer readable medium
US11037348B2 (en) Method and apparatus for displaying business object in video image and electronic device
US11182591B2 (en) Methods and apparatuses for detecting face, and electronic devices
US11544884B2 (en) Virtual clothing try-on
US11044295B2 (en) Data processing method, apparatus and electronic device
US11657575B2 (en) Generating augmented reality content based on third-party content
CN112513875B (en) Eye texture repair
US20210312523A1 (en) Analyzing facial features for augmented reality experiences of physical products in a messaging system
US11521334B2 (en) Augmented reality experiences of color palettes in a messaging system
US11915305B2 (en) Identification of physical products for augmented reality experiences in a messaging system
US20210312678A1 (en) Generating augmented reality experiences with physical products using profile information
US20220207875A1 (en) Machine learning-based selection of a representative video frame within a messaging application
CN115668263A (en) Identification of physical products for augmented reality experience in messaging systems
US11847756B2 (en) Generating ground truths for machine learning
US20240161179A1 (en) Identification of physical products for augmented reality experiences in a messaging system
Heo et al. Hand Segmentation for Optical See-through HMD Based on Adaptive Skin Color Model Using 2D/3D Images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17841127

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17841127

Country of ref document: EP

Kind code of ref document: A1