CN113989925A

CN113989925A - Face brushing interaction method and device

Info

Publication number: CN113989925A
Application number: CN202111235139.0A
Authority: CN
Inventors: 吕瑞; 杨成平
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-01-28

Abstract

The embodiment of the specification provides a face brushing interaction method and device. The method comprises the following steps: receiving at least two images sent by a camera in real time; detecting a target human body image in each image; positioning the target human body image; performing behavior classification according to the target human body image in the at least two images; displaying the interactive video; in the interactive video, according to the positioning of a target human body image, mapping the target human body image by using a first image; and controlling the action of the first image in the interactive video according to the behavior classification result. The scheme of the specification can conveniently start the face brushing service and expand the development and application of the face brushing service.

Description

Face brushing interaction method and device

Technical Field

One or more embodiments of the present disclosure relate to electronic information technology, and more particularly, to a method and apparatus for face brushing interaction.

Background

The face brushing service is a novel service mode taking face recognition as a core, and can be applied to various fields. For example, when the user passes through the entrance guard, face brushing verification is performed. For another example, when the user makes a payment, a face-brushing payment is performed.

However, the face brushing service cannot be started frequently at present, and the development and application of the face brushing service are limited.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method and an apparatus for face brushing interaction, which can open a face brushing service more conveniently, thereby expanding the development and application of the face brushing service.

According to a first aspect, there is provided a face brushing interaction method, comprising:

receiving at least two images sent by a camera in real time;

detecting a target human body image in each image;

positioning the target human body image;

performing behavior classification according to the target human body image in the at least two images;

displaying the interactive video; in the interactive video, according to the positioning of a target human body image, mapping the target human body image by using a first image; and controlling the action of the first image in the interactive video according to the behavior classification result.

The positioning the target human body image comprises: segmenting key part areas of the target human body image in each image; performing UV estimation on each of the divided key part areas to obtain U, V coordinates of each element in each key part area;

the mapping the target human body image using the first image according to the positioning of the target human body image includes: and obtaining the pixel position of the first image in the interactive video corresponding to the target human body image according to the divided key part areas and the U, V coordinates of each element, and displaying the first image in the interactive video according to the pixel position.

Wherein the critical site area comprises: at least one of a head region, a left arm region, a right arm region, an upper body region, a lower body region, a left leg region, and a right leg region.

The behavior classification according to the target human body image in the at least two images includes:

overlapping the same key part area segmented from at least two images to obtain a first feature vector;

overlapping the U coordinates representing the same element in at least two images to obtain a second feature vector;

superposing V coordinates representing the same element in at least two images to obtain a third feature vector;

splicing the first feature vector, the second feature vector and the third feature vector;

extracting the characteristics of the spliced vectors;

and performing behavior classification on the target human body image according to the feature extraction result.

Wherein, before splicing, the method further comprises the following steps: converting each feature vector into a C-dimensional feature vector;

the splicing comprises: and splicing in the C dimension.

Wherein the first image comprises: the decorated target human body image; and/or a preset cartoon image.

Wherein, each image sent by the camera comprises at least two human body images;

the detecting of the target human body image in each image includes:

estimating the depth of the human body corresponding to each human body image;

obtaining a depth value corresponding to each human body according to the depth estimation result; and

and determining the human body image of the human body corresponding to the minimum depth value as a target human body image.

After the performing the behavior classification, further comprising:

and executing the business processing corresponding to the behavior classification result according to the behavior classification result.

According to a second aspect, there is provided a brushing interaction device, comprising:

the image receiving module is configured to receive at least two images sent by the camera in real time;

the target determining module is configured to detect a target human body image in each image;

the positioning module is configured to position the target human body image;

the classification module is configured to perform behavior classification according to the target human body image in the at least two images;

the interactive module is configured to display interactive videos; in the interactive video, according to the positioning of a target human body image, mapping the target human body image by using a first image; and controlling the action of the first image in the interactive video according to the behavior classification result.

According to a third aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements a method as described in any of the embodiments of the present specification.

In the method and the device for face brushing interaction provided by the embodiments of the present specification, an interactive video is displayed to a user, and in the interactive video, instead of displaying a human body image of the user as in a conventional manner, a first image is displayed, that is, a first image that enables the user to generate interest or arouse the user's attention is used to map the human body image of the user. Meanwhile, in the interactive video, the behavior actions of the user (such as walking, waving hands, shaking heads and the like) are not directly displayed as in a common means, but the actions of the first image are controlled and displayed according to the behavior classification result of the user, that is, the behavior actions of the user are displayed to the user in a special way, that is, the actions of the user are mapped through the actions of the first image which can enable the user to generate interest or arouse the attention of the user, so that the user can pay attention to the face brushing service, and the face brushing service can be opened more smoothly.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present specification, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a face brushing interaction method in one embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a method for behavior classification of a target human body image according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating behavior classification of a target human body image according to an embodiment of the present disclosure.

Fig. 4 is a schematic structural diagram of a face brushing interaction device in an embodiment of the present disclosure.

Fig. 5 is a schematic structural diagram of a face brushing interaction device in another embodiment of the present disclosure.

Detailed Description

As described above, the face brushing service cannot be started frequently at present, so that the development and application of the face brushing service are limited. Taking face-brushing payment as an example, in an unmanned shop, a user cannot start face-brushing payment due to the fact that the user does not stand at a proper position during payment. For another example, in an unsold store, the user does not notice that the face-brushing payment function can be used for payment, and the face-brushing payment cannot be started. As another example, due to usage habits, the user pays using a code-scanning payment method rather than a face-brushing method.

The scheme provided by the specification is described below with reference to the accompanying drawings.

FIG. 1 is a flow chart illustrating a face brushing interaction method in one embodiment of the present disclosure. The execution main body of the method is a face brushing interaction device. It is to be understood that the method may also be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. Referring to fig. 1, the method includes:

step 101: and receiving at least two images sent by the camera in real time.

Step 103: and detecting the target human body image in each image.

Step 105: and positioning the target human body image.

Step 107: and performing behavior classification according to the target human body image in the at least two images.

Step 109: displaying the interactive video; in the interactive video, according to the positioning of a target human body image, mapping the target human body image by using a first image; and controlling the action of the first image in the interactive video according to the behavior classification result.

It can be seen that in the above-mentioned face brushing interaction method shown in fig. 1, an interactive video is presented to the user, in the interactive video, instead of presenting the human body image of the user as a common means, a first image is presented, that is, a first image capable of generating interest or attracting attention of the user is used to map the human body image of the user. Meanwhile, in the interactive video, the behavior actions of the user (such as walking, waving hands, shaking heads and the like) are not directly displayed as in a common means, but the actions of the first image are controlled and displayed according to the behavior classification result of the user, that is, the behavior actions of the user are displayed to the user in a special way, that is, the actions of the user are mapped through the actions of the first image which can enable the user to generate interest or arouse the attention of the user, so that the user can pay attention to the face brushing service, and the face brushing service can be opened more smoothly.

Each step shown in fig. 1 is explained below.

First, in step 101, at least two images from a camera are received in real time.

In this step 101, a user may be photographed by using a camera device such as an RGB camera, and a human body image including a human face of the user is obtained. The RGB camera is used for collecting RGB and depth images.

In one embodiment of the present specification, 8 images may be taken as an object of one-time processing, that is, the processing of the respective steps shown in fig. 1, including positioning and behavior classification, is performed based on 8 consecutive images.

Next, in step 103, the target human body image in each image is detected.

In step 103, an image recognition technology may be used to recognize a target human body image of a user in front of the camera in the image.

In an actual service scene, a plurality of human bodies are likely to appear simultaneously in a shooting range of the camera, and at this time, a target human body image in the plurality of human body images needs to be recognized (for example, an image of a human body arranged at the front is used as the recognized target human body image), so that a subsequent face brushing interaction process is performed for a target user corresponding to the target human body image. In this case, each image shot from the image head in step 101 includes at least two human body images; then one implementation of step 103 includes:

step 1031: estimating the depth of the human body corresponding to each human body image;

step 1033: obtaining a depth value corresponding to each human body according to the depth estimation result; and

step 1035: and determining the human body image of the human body corresponding to the minimum depth value as a target human body image.

In an embodiment of the present specification, an implementation procedure of the step 1031 may include:

firstly, performing panoramic segmentation on each image, namely inputting three-dimensional pixel points of H, W and 3 in the image into a pre-trained panoramic segmentation model to obtain one-dimensional numbers which are output by the panoramic segmentation model and correspond to human body images;

then, mapping each one-dimensional number output by the panoramic segmentation model into each N-dimensional vector through an embedded layer of a pre-trained depth estimation model, and generating a segmentation vector graph by using each N-dimensional vector; and performing feature extraction on the segmentation vector graph by using a convolutional neural network of the depth estimation model to obtain a depth graph and a confidence map corresponding to each human body image, wherein each pixel value in the depth graph represents a depth value of a position distance shooting source corresponding to the pixel, and each pixel value in the confidence map represents a confidence of the depth value corresponding to the pixel.

Accordingly, one implementation of step 1033 includes: for each human body image, executing:

selecting pixels with the confidence level higher than 0.5 from the depth map corresponding to the human body image according to the confidence level map corresponding to the human body image; and

and calculating the average value of the depth values of the selected pixels, and determining the obtained average value as the depth value of the human body corresponding to the human body image.

Next, in step 105, the target human body image is located.

In one embodiment of the present specification, one implementation of this step 105 includes:

step 1051: segmenting key part areas of the target human body image in each image;

step 1053: UV estimation of each of the segmented critical area regions resulted in U, V coordinates for each element in each critical area region.

In this way, the key site areas and UV coordinates can be used to locate the target human image.

Wherein, the key part area may include: at least one of a head region, a left arm region, a right arm region, an upper body region, a lower body region, a left leg region, and a right leg region. In step 1051, the target human body image in the image may be segmented into a head region, a left arm region, a right arm region, an upper body region, a lower body region, a left leg region, and a right leg region of the human body.

In step 1053, UV coordinate techniques are used. The UV coordinates, in short, correspond each element on the image to the surface of the model object. In this specification, the target human body image is mapped using the first image, the target human body image is 2-dimensional, and for the purpose of vividness and three-dimensional display, the first image generated in the system may generally adopt a 3-dimensional model, and the vertex of the polygon of the 3-dimensional model (i.e., the first image) and the pixel in the key region on the image can be mapped by the process of step 1053 using the UV coordinates.

In an actual business implementation, a dense pose estimation model may be trained in advance, and after the target human body image is detected in step 103, the target human body image may be input into the dense pose estimation model, so that, in step 105, the segmentation of each key part region of the target human body image and U, V coordinates of each element in each key part region may be output by performing estimation by the dense pose estimation model.

Next, in step 107, behavior classification is performed based on the target human body image in the at least two images.

Here, the behavior classification is performed based on the plurality of images in order to recognize the behavior of the corresponding user, such as nodding, shaking, waving, walking, and the like. So as to control the action of the first image according to the action.

In step 107, behavior classification can be performed using the critical region areas and the UV coordinates obtained in step 105. Because the behavior classification is performed, a plurality of continuous images in time sequence must be utilized, and the dense pose estimation model can be superposed according to the output of the plurality of continuous images, and referring to fig. 2, the specific implementation process includes:

step 201: overlapping the same key part area segmented from at least two images to obtain a first feature vector;

step 203: overlapping the U coordinates which are segmented from at least two images and represent the same element in the same key part area to obtain a second feature vector;

step 205: superposing V coordinates which are segmented from at least two images and represent the same element in the same key part area to obtain a third feature vector;

step 207: splicing the first feature vector, the second feature vector and the third feature vector;

step 209: extracting the characteristics of the spliced vectors;

step 211: and performing behavior classification on the target human body image according to the feature extraction result.

Before proceeding to step 207, further comprising: converting each feature vector into a C-dimensional feature vector; the splicing of step 207 includes: and splicing in the C dimension. C is a natural number not less than 1. Here, the feature vector that can be converted into the C dimension may be, for example, 3 or 4 dimensions. The larger the value of C, i.e. the higher the dimension, the larger the representation space, and the more accurate the representation. The value of C may be set according to the requirements of actual computing power, i.e. expression accuracy.

The implementation of step 107 described above can also be seen in the schematic diagram of fig. 3. Referring to fig. 3, a time-series behavior classification model can be trained in advance, and the input of the time-series behavior classification model is the output of the dense attitude estimation model: the output of the time sequence behavior classification model is the behavior classification result of the target human body image, such as head nodding, head shaking and hand waving classification.

Referring to fig. 3, for the time-series behavior classification model, the size of each input module may be N (F × K) H × W, where N is set to 1 during forward operation, and FK is the number of time-series channels (F is the number of frames, and K is the number of channels, i.e., the number of segmented critical region). After three streams are input, Conv convolution calculation is respectively used, channels are converted into the C dimension, then Concat layers are respectively input, splicing is carried out on the C dimension, 3C is spliced to carry out feature fusion, then the C dimension is sent into a convolution neural network to carry out feature extraction, finally feature classification is carried out through softmax, and behavior categories are output.

Next, the interactive video is presented in step 109; in the interactive video, according to the positioning of a target human body image, mapping the target human body image by using a first image; and controlling the action of the first image in the interactive video according to the behavior classification result.

In the embodiments of the present disclosure, the first image may be any image that can generate interest or draw attention of the user.

In one approach, the first image may be a finished target body image, such as a face-beautified target body image, or an antique target body image.

In another mode, the first image may be a preset cartoon image, such as an image in an animation film, or an animation image. In an embodiment of the present specification, the face recognition may be performed according to the obtained target human body image, the identity of the user is recognized, historical behavior data of the user, such as an animation or a cartoon frequently seen by the user, is obtained according to the identity of the user, a cartoon image interested by the user is determined through analysis of the historical behavior data, and the determined personalized cartoon image for the user is used as the first image to map the target human body image of the user.

In an embodiment of the present specification, the mapping of the target human body image by using the first image in step 109 may be implemented by using the UV coordinates of each key region and each element obtained in step 105, and one specific implementation process includes: and obtaining the pixel position of the first image in the interactive video corresponding to the target human body image according to the key part areas in the divided image and the U, V coordinates of each element in each key part area, and displaying the first image in the interactive video according to the pixel position.

For example, in a face-brushing payment service, an interactive video is displayed for a user, an image of the user is mapped by using a cartoon character in the interactive video, the position of the cartoon character in the interactive video moves along with the position of the user in front of a camera, and the action of the cartoon character also follows the action of the user, for example, when the user walks, the cartoon character also walks, when the user clicks, the cartoon character also clicks, and when the user waves or turns, the cartoon character also waves or turns.

Therefore, the human body image of the user is mapped by the first image which can enable the user to generate interest or arouse the attention of the user, and the first image is interactive with the user, so that the user can pay attention to the face brushing service, and the face brushing service can be started more smoothly. Taking face-swiping payment as an example, in an unmanned shop, even if a user does not stand at a proper position during payment, the user can be guided to stand at the proper payment position by utilizing the interaction of the first image, such as a card-passing person guiding position mode, so that face-swiping payment is started. For another example, in an unscaled store, the user may not notice that the user can pay using the face-brushing payment function, but the user may notice the face-brushing interaction device due to the interaction of the first image in the interactive video, and pay by face brushing may be turned on. For another example, although the user is used to pay by scanning the code, the user may be interested in paying by brushing the face because of the interaction effect of the first image, so that the user may pay by brushing the face.

In an embodiment of the present specification, after the result of the behavior classification of the target human body image is obtained in step 107, a business process corresponding to the result of the behavior classification may be further performed. For example, it is preset that the heading behavior of the user corresponds to starting of the face brushing payment, and if the behavior classification result obtained in step 107 is heading, a corresponding face brushing payment process is executed, for example, the process includes recognizing and deducting the face of the user.

In one embodiment of the present description, referring to fig. 4, there is provided a brushing interaction device, comprising:

the image receiving module 401 is configured to receive at least two images sent by a camera in real time;

a target determination module 402 configured to detect a target human body image in each image;

a positioning module 403 configured to position the target human body image;

a classification module 404 configured to perform behavior classification according to the target human body image in the at least two images;

an interactive module 405 configured to display an interactive video; in the interactive video, according to the positioning of a target human body image, mapping the target human body image by using a first image; and controlling the action of the first image in the interactive video according to the behavior classification result.

In one embodiment of the apparatus of the present description, the positioning module 403 is configured to perform: segmenting key part areas of the target human body image in each image; performing UV estimation on each of the divided key part areas to obtain U, V coordinates of each element in each key part area;

accordingly, the interaction module 405 is configured to perform: and obtaining the pixel position of the first image in the interactive video corresponding to the target human body image according to the divided key part areas and the U, V coordinates of each element, and displaying the first image in the interactive video according to the pixel position.

In one embodiment of the apparatus of the present description, the classification module 404 is configured to perform:

extracting the characteristics of the spliced vectors;

In one embodiment of the apparatus of the present description, the classification module 404 is further configured to: before the splicing is carried out, converting each feature vector into a C-dimensional feature vector; and performing the above splicing in the C dimension.

In one embodiment of the apparatus of the present specification, each image sent by the camera includes at least two human images;

a targeting module 402 configured to perform:

estimating the depth of the human body corresponding to each human body image;

In one embodiment of the device of the present description, referring to fig. 5, the brushing interaction device further comprises: the service processing module 501 is configured to execute service processing corresponding to the behavior classification result according to the behavior classification result.

An embodiment of the present specification provides a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of the embodiments of the specification.

One embodiment of the present specification provides a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor implementing a method in accordance with any one of the embodiments of the specification when executing the executable code.

It is to be understood that the illustrated construction of the embodiments herein is not to be construed as limiting the apparatus of the embodiments herein specifically. In other embodiments of the description, the apparatus may include more or fewer components than illustrated, or some components may be combined, some components may be separated, or a different arrangement of components may be used. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

For the information interaction, execution process and other contents between the modules in the above-mentioned apparatus and system, because the same concept is based on the embodiment of the method in this specification, specific contents may refer to the description in the embodiment of the method in this specification, and are not described herein again.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this disclosure may be implemented in hardware, software, hardware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. The face brushing interaction method comprises the following steps:

receiving at least two images sent by a camera in real time;

detecting a target human body image in each image;

positioning the target human body image;

2. The method of claim 1, wherein,

3. The method of claim 2, wherein the critical site regions comprise: at least one of a head region, a left arm region, a right arm region, an upper body region, a lower body region, a left leg region, and a right leg region.

4. The method of claim 2, wherein the performing behavior classification according to the target human body image in the at least two images comprises:

extracting the characteristics of the spliced vectors;

5. The method of claim 4, wherein prior to splicing, further comprising: converting each feature vector into a C-dimensional feature vector;

the splicing comprises: and splicing in the C dimension.

6. The method of any of claims 1 to 5, wherein the first image comprises: the decorated target human body image; and/or a preset cartoon image.

7. The method according to any one of claims 1 to 5, wherein each image sent by the camera comprises at least two human body images;

the detecting of the target human body image in each image includes:

estimating the depth of the human body corresponding to each human body image;

8. The method of any of claims 1 to 5, further comprising, after the performing behavior classification:

9. Face interaction device brushes includes:

the positioning module is configured to position the target human body image;

10. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-8.