CN115205925A

CN115205925A - Expression coefficient determining method and device, electronic equipment and storage medium

Info

Publication number: CN115205925A
Application number: CN202210641776.6A
Authority: CN
Inventors: 叶奎; 张国鑫; 马里千; 刘晓强
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-10-18

Abstract

The disclosure relates to an expression coefficient determining method, an expression coefficient determining device, electronic equipment and a storage medium, and relates to the technical field of internet. For the current image information, under the condition that the current image information contains face information and the face information does not meet preset conditions, a first expression coefficient is extracted from the face information, a second expression coefficient is predicted from audio information corresponding to the current image information, and the current image information is any frame of image information. And then obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient. The target expression coefficient is obtained by combining the first expression coefficient extracted from the face information contained in the image information and the second expression coefficient predicted from the audio information, and the accuracy of the expression coefficient of the identified user object can be improved.

Description

Expression coefficient determining method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to an expression coefficient determining method and apparatus, an electronic device, and a storage medium.

Background

The facial expression recognition means that a computer is used for extracting expression features of a detected facial image to obtain an expression coefficient, wherein the expression coefficient refers to facial expression description information which can be understood by the computer. Based on the expression coefficients, a more friendly and intelligent human-computer interaction environment can be established.

The avatar expression driver is one of application scenes of the facial expression recognition technology. For example, in a virtual live scene, a live screen of a main broadcast object and an avatar of the main broadcast object are generally displayed simultaneously in a virtual live interface. The computer extracts features from the expressions of the anchor object to obtain the expression coefficients of the anchor object, and then drives the expressions of the virtual images of the anchor object by using the expression coefficients of the anchor object. When the expression of the anchor object changes, the expression of the avatar changes at the same time.

Due to the accuracy of the expression coefficient of the anchor object recognized by the computer, the driving effect on the expression of the avatar can be affected, such as unnatural expression of the avatar, mismatching with the expression of the anchor object, and the like. Therefore, how to improve the accuracy of the identified expression coefficient of the anchor object becomes a technical problem to be solved urgently.

Disclosure of Invention

The present disclosure provides an expression coefficient determination method, apparatus, electronic device, and storage medium, which can improve accuracy of an identified expression coefficient.

The technical scheme of the embodiment of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an expression coefficient determination method, including: and acquiring video information, wherein the video information comprises multi-frame image information and audio information corresponding to each frame of image information. For the current image information, under the condition that the current image information contains face information and the face information does not meet preset conditions, a first expression coefficient is extracted from the face information, a second expression coefficient is predicted from audio information corresponding to the current image information, and the current image information is any frame of image information. And obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient.

Optionally, obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient includes: and fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

Optionally, the fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information includes: and weighting the first expression coefficient and the second expression coefficient according to the preset weight of the first expression coefficient and the preset weight of the second expression coefficient to obtain a fusion expression coefficient, wherein the preset weight of the first expression coefficient is smaller than the preset weight of the second expression coefficient.

Optionally, the method further includes: and under the condition that the current image information does not contain face information, predicting a second expression coefficient from the audio information corresponding to the current image information, and taking the second expression coefficient as a fusion expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

Optionally, the method further includes: and under the condition that the image information contains face information and the face information meets preset conditions, extracting a first expression coefficient from the face information, and taking the first expression coefficient as a fusion expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

Optionally, after the video information is acquired, the method further includes: and recognizing the face information in the current image information. And under the condition of identifying the face information, determining the posture angle and/or the integrity corresponding to the face information, and indicating whether the face information is shielded or not by using the integrity. And if the attitude angle corresponding to the face information does not meet the preset angle and/or the integrity does not meet the preset integrity, determining that the image information contains the face information and the face information does not meet the preset condition. And if the posture angle corresponding to the face information meets the preset angle and/or the integrity meets the preset integrity, determining that the image information contains the face information and the face information meets the preset conditions.

According to a second aspect of the embodiments of the present disclosure, there is provided an expression coefficient determining apparatus, including: an information acquisition unit configured to perform acquisition of video information including a plurality of frames of image information and audio information corresponding to each frame of image information. The first determining unit is configured to extract a first expression coefficient from the face information and predict a second expression coefficient from audio information corresponding to the current image information under the condition that the face information is included in the current image information and the face information does not meet a preset condition, wherein the current image information is any frame of image information. And the second determining unit is configured to execute obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient.

Optionally, the second determining unit is specifically configured to perform: and fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

Optionally, the second determining unit is further configured to perform: and according to the preset weight of the first expression coefficient and the preset weight of the second expression coefficient, carrying out weighting processing on the first expression coefficient and the second expression coefficient to obtain a fusion expression coefficient, wherein the preset weight of the first expression coefficient is smaller than the preset weight of the second expression coefficient.

Optionally, the first determining unit is further configured to perform: and under the condition that the current image information does not contain face information, predicting a second expression coefficient from the audio information corresponding to the current image information, and taking the second expression coefficient as a fusion expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

Optionally, the second determining unit is further configured to perform: and under the condition that the image information contains face information and the face information meets preset conditions, extracting a first expression coefficient from the face information, and taking the first expression coefficient as a fusion expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

Optionally, after the video information is acquired, the information acquiring unit is further configured to perform: and identifying the face information in the current image information. And under the condition of identifying the face information, determining the posture angle and/or the integrity corresponding to the face information, and indicating whether the face information is shielded or not by using the integrity. And if the posture angle corresponding to the face information does not meet the preset angle and/or the integrity does not meet the preset integrity, determining that the image information contains the face information and the face information does not meet the preset conditions. And if the posture angle corresponding to the face information meets the preset angle and/or the integrity meets the preset integrity, determining that the image information contains the face information and the face information meets the preset conditions.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement any one of the above-described optional expression coefficient determination methods of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon, which, when executed by a processor of an electronic device, enable the electronic device to perform any one of the above-mentioned optional expression coefficient determination methods of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product which, when executed by a processor, implements the method of determining an expression coefficient as any one of the optional implementations of the first aspect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

based on any one of the above aspects of the present disclosure, there is provided an expression coefficient determination method, including: and acquiring video information, wherein the video information comprises multi-frame image information and audio information corresponding to each frame of image information. For the current image information, under the condition that the current image information contains face information and the face information does not meet preset conditions, a first expression coefficient is extracted from the face information, a second expression coefficient is predicted from audio information corresponding to the current image information, and the current image information is any frame of image information. And obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient. The target expression coefficient is obtained by combining the first expression coefficient extracted from the face information contained in the image information and the second expression coefficient predicted from the audio information, and the accuracy of the expression coefficient of the identified user object can be improved. For example, in a virtual live broadcast scene, the accuracy of the expression coefficient of the identified anchor object can be improved, so that the driving effect on the expression of the avatar is ensured, and the problems that the expression of the avatar is unnatural and is not matched with the expression of the anchor object are avoided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 shows a schematic structural diagram of a live broadcast system provided by an embodiment of the present disclosure;

fig. 2 is a schematic flowchart illustrating an expression coefficient determining method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart illustrating another method for determining an expression coefficient according to an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating another method for determining an expression coefficient according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart illustrating another expression coefficient determining method according to an embodiment of the disclosure;

fig. 6 is a schematic flowchart illustrating another expression coefficient determining method according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram illustrating an expression coefficient determining apparatus according to an embodiment of the present disclosure;

fig. 8 shows a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or components.

It should be noted that, the user information (including but not limited to user device information, user personal information, user behavior information, etc.) and data (including but not limited to page data corresponding to a dynamic page, etc.) related to the present disclosure are data authorized by the user or fully authorized by each party.

The facial expression recognition is to extract expression features of a detected facial image by using a computer to obtain an expression coefficient (also called a 3D expression coefficient), where the expression coefficient refers to description information of facial expressions that can be understood by the computer. Based on the expression coefficients, a more friendly and intelligent human-computer interaction environment can be established.

Due to the accuracy of the expression coefficients of the anchor object recognized by the computer, the driving effect on the expressions of the avatar can be affected. Therefore, how to improve the accuracy of the identified expression coefficient of the anchor object becomes a technical problem to be solved urgently.

Based on this, the embodiment of the present disclosure provides an expression coefficient determining method, including: and acquiring video information, wherein the video information comprises multi-frame image information and audio information corresponding to each frame of image information. For the current image information, under the condition that the current image information contains face information and the face information does not meet preset conditions, a first expression coefficient is extracted from the face information, a second expression coefficient is predicted from audio information corresponding to the current image information, and the current image information is any frame of image information. And obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient. The target expression coefficient is obtained by combining the first expression coefficient extracted from the face information contained in the image information and the second expression coefficient predicted from the audio information, and the accuracy of the expression coefficient of the identified user object can be improved. For example, in a virtual live broadcast scene, the accuracy of the expression coefficient of the identified anchor object can be improved, so that the driving effect on the expression of the avatar is ensured, and the problems that the expression of the avatar is unnatural and is not matched with the expression of the anchor object are avoided.

An application scenario of the expression coefficient determination method provided by the embodiment of the present disclosure is exemplarily described below:

fig. 1 is a schematic view of a live broadcast system according to an embodiment of the present disclosure, where the live broadcast system may implement the virtual live broadcast scene. As shown in fig. 1, the live system includes: server 110, first terminal device 120, and second terminal device 130. The server 110 may establish a connection with the first terminal device 120 and the second terminal device 130 through a wired network or a wireless network.

The server 110 may be configured to receive a live content from the first terminal device 120, where the live content is used to display a live interface, the live interface includes a live frame of the anchor object and an avatar of the anchor object determined according to an expression coefficient of the anchor object, and an expression of the avatar changes with a change in the expression of the anchor object, that is, the avatar matches an expression of the anchor object, and send the live content to the second terminal device 130 of the audience object, so that the audience object can view the live content of the anchor object. In some embodiments, the server 110 may also determine an expression coefficient of the anchor object according to a live frame in the live content sent by the first terminal device 120, and drive the expression of the avatar of the active object according to the expression coefficient of the anchor object.

In some embodiments, the server 110 may be a single server, or may be a server cluster composed of a plurality of servers (or micro servers). The server cluster may also be a distributed cluster. The present disclosure is also not limited to a particular implementation of the server 110.

The first terminal device 120 may be configured to obtain video information of the user object, where the video information includes multiple frames of image information and audio information corresponding to each frame of image information. The first terminal device 120 includes an image input device and an audio input device, for example, the image input device may be a camera, and the audio input device may be a microphone. The multi-frame image information is acquired through the image input device, and the audio information is acquired through the audio input device.

In an implementation manner of the present disclosure, the terminal device first determines a first expression coefficient according to the image information, determines a second expression coefficient according to the audio information, then determines a target expression coefficient according to the first expression coefficient and the second expression coefficient, generates an avatar of the anchor object according to the target expression coefficient, finally obtains a live broadcast content including a live broadcast frame of the anchor object and the avatar, and sends the live broadcast content to the server 110.

In another implementation manner of the present disclosure, the terminal device sends multi-frame image information and audio information image information corresponding to each frame of image information to the server 110, the server 110 determines a first expression coefficient according to the image information, determines a second expression coefficient according to the audio information, then determines a target expression coefficient according to the first expression coefficient and the second expression coefficient, generates an avatar of the anchor object according to the target expression coefficient, finally obtains a live broadcast frame containing the anchor object and a live broadcast content of the avatar, and sends the live broadcast content to the first terminal device 120 and the second terminal device 130, where the first terminal device 120 may be a device used by the anchor object, and the second terminal device 130 may also be a device used by the audience object.

In some embodiments, the terminal device may be a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, or the like, which can install and use various applications (such as a fast hand), and the disclosure is not limited to the specific form of the terminal. The system can be used for man-machine interaction with a user through one or more modes of a keyboard, a touch pad, a touch screen, a remote controller, voice interaction or handwriting equipment and the like.

Alternatively, the server 110 in the communication system shown in fig. 1 may be connected to the first terminal device 120 and the second terminal device 130, where the first terminal device 120 may be a device used by a main object, and the second terminal device 130 may also be a device used by a viewer object. The present disclosure is not limited to the number or type of terminal devices 120.

The method for determining the expression coefficient provided by the embodiment of the present disclosure may be applied to the first terminal device 120 shown in fig. 1, and may also be applied to the server 110 or other electronic devices.

In some embodiments, an executing subject of the expression coefficient determining method provided by the present disclosure may be an expression coefficient determining apparatus, and the expression coefficient determining apparatus may be built in an electronic device, such as the terminal device 120 described above, or built in the server 110.

Fig. 2 is a flowchart of an expression coefficient determining method provided in an embodiment of the present disclosure, and as shown in fig. 2, the expression coefficient determining method may include:

s201, video information is obtained, wherein the video information comprises multi-frame image information and audio information corresponding to each frame of image information.

In the implementation manner, the image input device acquires the image information in real time, and the audio input device acquires the audio information in real time to obtain the video information. The video information is specifically an image sequence composed of a plurality of frames of images. Optionally, the image sequence is averagely divided into M groups of images, where each group of images includes N frames of image information and corresponds to a segment of audio information. And taking the image information of the preset frame sequence in the N frames of image information as the image information to be processed, wherein the audio information corresponding to each group of images corresponds to the image information to be processed corresponding to each group of images. For example, in the process of virtual live broadcasting of a main broadcasting object, within a time range of 1s, video information of the main broadcasting object is acquired to include 30 frames of image information, the 30 frames of image information are firstly split into 3 groups, each group of images includes 10 frames of image information, then the image information with the frame sequence of 1 in each group of images is taken as image information to be processed, and simultaneously audio information of 1s is split into 3 parts, and each part corresponds to the image information to be processed in each group of images.

S202, for the current image information, under the condition that the current image information contains face information and the face information does not meet preset conditions, extracting a first expression coefficient from the face information, and predicting a second expression coefficient from audio information corresponding to the current image information, wherein the current image information is any frame of image information.

In the above implementation manner, the expression coefficient refers to description information of a facial expression that can be understood by a computer. The first expression coefficient is determined by feature extraction of face information in current image information, and the second expression coefficient is predicted from audio information by inputting the audio information into a prediction model. The prediction model is obtained by training according to the audio samples and the corresponding expression coefficients. In a virtual live broadcast scene, the feature extraction is carried out on the face information in the image information of the anchor object, the first expression coefficient of the anchor object is determined, and the audio information of the anchor object is input into a prediction model for prediction, so that the second expression coefficient of the anchor object is obtained.

In the implementation manner, the preset condition is that the posture angle corresponding to the face information satisfies the preset angle, and the integrity corresponding to the face information satisfies the preset integrity. The pose angle is an angle formed between a plane where the face is located and the image. It should be understood that when the pose angle does not satisfy the preset angle, the face information in the current image information cannot be accurately identified, and the expression coefficient extracted from the inaccurate face information is also inaccurate. For example, when the anchor object performs live broadcasting, the head is raised, lowered or the head is turned left/right, and at this time, although the image input device acquires a part of the face region of the anchor object, since the pose angle corresponding to the face information in the current image information does not satisfy the preset angle, the entire content of the face information cannot be acquired, which results in inaccurate recognition, and the obtained face information is not authentic. The integrity is used for representing whether the face information is shielded or not, and if the face information is shielded, the face information in the current image information cannot be accurately identified under the condition that the posture angle of the face information of the current anchor object meets a preset angle, so that the expression coefficient extracted from inaccurate face information is also inaccurate. For example, when the anchor object is live, the face area is occluded by other objects. At this time, although the image input device may also acquire a partial face region of the anchor object, since the face information in the current image information has a blocked portion, the entire content of the face information cannot be acquired, which results in inaccurate recognition, and the obtained face information is also unreliable. Therefore, the second expression coefficient corresponding to the anchor object needs to be predicted from the audio information of the anchor object, and the accuracy of the identified expression coefficient of the anchor object is improved by combining the first expression coefficient and the second expression coefficient, so that the driving effect on the expression of the avatar is ensured, and the problems that the expression of the avatar is unnatural and is not matched with the expression of the anchor object are avoided.

In a scene that a anchor object carries out virtual live broadcasting, video information input by the anchor object needs to be processed to obtain an avatar corresponding to the anchor object, because the video information is continuous multi-frame image information and audio information corresponding to the multi-frame image information, sequential frame images need to be processed sequentially, and the current image information is image information currently being processed.

In one implementation, an application (e.g., a live application) typically includes a video driver module and an audio driver module,

the video driving module is used for extracting a first expression coefficient from the face information, and predicting a second expression coefficient from the audio information corresponding to the current image information according to the audio driving module. At this time, the video driving module and the audio driving module are both in an open state.

And S203, obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient.

From the foregoing S201-203, it can be seen that the target expression coefficient is obtained by combining the first expression coefficient extracted from the face information included in the image information and the second expression coefficient predicted from the audio information, and the accuracy of the expression coefficient of the identified user object can be improved. For example, in a virtual live broadcast scene, the accuracy of the expression coefficient of the identified anchor object can be improved, so that the driving effect on the expression of the avatar is ensured, and the problems that the expression of the avatar is unnatural and is not matched with the expression of the anchor object are avoided.

In an implementation manner, referring to fig. 3, the step S203 specifically includes:

s301, fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information.

In one implementation manner, fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information, including: and weighting the first expression coefficient and the second expression coefficient according to the preset weight of the first expression coefficient and the preset weight of the second expression coefficient to obtain a fused expression coefficient.

Illustratively, the first expression coefficient is preset with a first weight of 0.2, and the second expression coefficient is preset with a second weight of 0.8. The fusion expression coefficients are: and the sum of the product of the first expression coefficient and the first weight and the product of the second expression coefficient and the second weight.

According to the content, the specific implementation mode for determining the fusion expression coefficient is provided, the accuracy of recognition can be improved through the fusion expression coefficient determined by the first expression coefficient and the second expression coefficient, and the problem that the virtual image expression is unnatural is avoided.

S302, optimizing a fusion expression coefficient corresponding to the current image information according to a preset smooth coefficient and a target expression coefficient corresponding to the previous frame of image information of the current image information to obtain a target expression coefficient corresponding to the current image information.

The smoothing coefficient is preset and used for enabling the avatar expression of the anchor object to be smoother and more natural in the virtual live scene, and the use experience of a user is improved.

In a specific implementation manner of the foregoing S302, a product of a target expression coefficient corresponding to the previous frame of image information and a preset first smoothing coefficient is determined, a product of a fused expression coefficient corresponding to the current frame of image information and a second smoothing coefficient is determined at the same time, and a product of the target expression coefficient corresponding to the previous frame of image information and the preset first smoothing coefficient and a product of the fused expression coefficient corresponding to the current frame of image information and the second smoothing coefficient are added to obtain a target expression coefficient corresponding to the current frame of image information. The target expression coefficient is obtained by fusing the first expression coefficient extracted from the face information contained in the image information and the second expression coefficient extracted from the audio information, so that the accuracy of the identified expression coefficient of the anchor object can be improved, the driving effect of the expression of the virtual image is further ensured, and the problems that the expression of the virtual image is unnatural and is not matched with the expression of the anchor object are avoided.

In one implementation, referring to fig. 4, the method further includes:

s401, under the condition that the current image information does not contain face information, predicting a second expression coefficient from the audio information corresponding to the current image information, and taking the second expression coefficient as a fusion expression coefficient corresponding to the current image information.

S402, optimizing a fusion expression coefficient corresponding to the current image information according to a preset smooth coefficient and a target expression coefficient corresponding to the previous frame of image information of the current image information to obtain a target expression coefficient corresponding to the current image information.

According to the above contents, under the condition that the current image information does not contain face information, the face information in the current image information cannot be acquired, so that the determined first expression coefficient is inaccurate, and further the fusion expression coefficient determined according to the first expression coefficient is necessarily inaccurate, the second expression coefficient is predicted from the audio information corresponding to the current image information to be used as the fusion expression coefficient corresponding to the current image information, so that the problem that the final fusion expression coefficient is determined according to the unavailable face information to cause unnatural and inaccurate expression of the virtual image can be avoided, meanwhile, the accuracy of the expression of the virtual image can be effectively improved by optimizing the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information, the expression is more natural and smooth, and the use experience of a user is improved.

In one implementation, the preset weight of the first expression coefficient is smaller than the preset weight of the second expression coefficient.

Under the condition that the face information in the current image information does not meet the preset condition, namely the corresponding attitude angle of the face information meets the preset angle, and the integrity meets the preset integrity, at the moment, the face information is not trusted, namely, the accuracy of the first expression coefficient extracted according to the face information in the current image information is lower, and meanwhile, the accuracy of the second expression coefficient predicted from the corresponding audio information is higher, so that the preset weight of the first expression coefficient is reduced, the preset weight of the second expression coefficient is improved, the accuracy of recognition can be effectively improved, the problem that the virtual image expression is unnatural is avoided, and the use experience of a user is improved.

In one implementation, referring to fig. 5, the method further comprises:

s501, under the condition that the image information contains face information and the face information meets preset conditions, extracting a first expression coefficient from the face information, and taking the first expression coefficient as a fusion expression coefficient corresponding to the current image information;

s502, optimizing a fusion expression coefficient corresponding to the current image information according to a preset smooth coefficient and a target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

In the above implementation manner, the preset condition is that the posture angle corresponding to the face information satisfies the preset angle, and the integrity corresponding to the face information satisfies the preset integrity. And under the condition that the image information contains the face information and the face information meets the preset condition, the obtained face information is credible. Because the face information of the current image information is credible, the terminal device extracts the first expression coefficient from the face information according to the video driving module at the moment, and does not need to predict a second expression coefficient from the audio information corresponding to the current image information according to the audio driving module again. The audio driver is now in the off state.

According to the above, under the condition that the current image information contains face information and the face information meets the preset conditions, namely under the condition that the face information is credible, the first expression coefficient is extracted from the face information according to the video drive, the second expression coefficient does not need to be predicted from the audio information corresponding to the current image information according to the audio drive again, and the audio drive is in a closed state at the moment, so that resources can be effectively saved.

In one implementation, referring to fig. 6, after the step S201, the method further includes the following steps:

s601, identifying face information in current image information;

s602, under the condition that the face information is identified, determining a posture angle and/or integrity corresponding to the face information, and indicating whether the face information is shielded or not by the integrity;

s603, if the posture angle corresponding to the face information does not meet the preset angle and/or the integrity does not meet the preset integrity, determining that the image information contains the face information and the face information does not meet the preset conditions;

s604, if the posture angle corresponding to the face information meets a preset angle and/or the integrity meets a preset integrity, determining that the image information contains the face information and the face information meets a preset condition.

In the above implementation, when the anchor object performs live broadcasting, the face information in the current image information is identified. If the current image information exists, determining the attitude angle and the integrity corresponding to the face information, and determining that the image information contains the face information and the face information meets the preset conditions under the condition that the attitude angle corresponding to the face information meets the preset angle and the integrity meets the preset integrity. The obtained face information is credible.

According to the method, the face information is determined to contain the face information and meet the preset conditions under the condition that the posture angle corresponding to the face information meets the preset angle and the integrity meets the preset integrity, the face information meeting the preset conditions is regarded as credible, and the identification accuracy can be effectively improved.

It is understood that, in practical implementation, the terminal/server of the embodiment of the present disclosure may include one or more hardware structures and/or software modules for implementing the corresponding method for determining an expression coefficient, and the executing hardware structures and/or software modules may constitute an electronic device. Those of skill in the art will readily appreciate that the present disclosure can be implemented in hardware or a combination of hardware and computer software for implementing the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

Based on such understanding, the embodiment of the present disclosure also provides an expression coefficient determining apparatus, which may be applied to an electronic device.

Fig. 7 shows a schematic structural diagram of an expression coefficient determination apparatus provided in an embodiment of the present disclosure. As shown in fig. 7, the expression coefficient determination means may include: an information acquisition unit 710, a first determination unit 720, and a second determination unit 730. An information acquisition unit 710 configured to perform acquisition of video information including a plurality of frames of image information and audio information corresponding to each frame of image information. For performing step S201 in the above method. The first determining unit 720 is configured to extract a first expression coefficient from the face information and predict a second expression coefficient from audio information corresponding to the current image information, where the current image information is any frame of image information, if the face information is included in the current image information and the face information does not satisfy a preset condition. For performing step S202 in the above method. And the second determining unit 730 is configured to perform obtaining of the target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient. For example, for performing step S203 in the above-described method.

Optionally, the second determining unit 730 is specifically configured to perform: and fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information. For example for performing steps S301-S302 in the above-described method.

Optionally, the second determining unit 730 is further configured to perform: and according to the preset weight of the first expression coefficient and the preset weight of the second expression coefficient, carrying out weighting processing on the first expression coefficient and the second expression coefficient to obtain a fusion expression coefficient, wherein the preset weight of the first expression coefficient is smaller than the preset weight of the second expression coefficient.

Optionally, the first determining unit 720 is further configured to perform: and under the condition that the current image information does not contain face information, predicting a second expression coefficient from the audio information corresponding to the current image information, and taking the second expression coefficient as a fusion expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information. For example for performing steps S401-S402 in the above-described method.

Optionally, the second determining unit 730 is further configured to perform: and under the condition that the image information contains face information and the face information meets preset conditions, extracting a first expression coefficient from the face information, and taking the first expression coefficient as a fusion expression coefficient corresponding to the current image information. And optimizing the fusion expression coefficient corresponding to the current image information according to the preset smooth coefficient and the target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information. For example for performing steps S501-S502 in the above-described method.

Optionally, after acquiring the video information, the information acquiring unit 710 is further configured to perform: and recognizing the face information in the current image information. And under the condition that the face information is identified, determining the attitude angle and/or the integrity corresponding to the face information, and indicating whether the face information is shielded or not by the integrity. And if the posture angle corresponding to the face information does not meet the preset angle and/or the integrity does not meet the preset integrity, determining that the image information contains the face information and the face information does not meet the preset conditions. And if the posture angle corresponding to the face information meets the preset angle and/or the integrity meets the preset integrity, determining that the image information contains the face information and the face information meets the preset conditions. For example, for performing steps S601-S604 in the above-described method.

As above, the embodiment of the present disclosure may perform division of functional modules on an electronic device according to the above method example. The integrated module can be realized in a hardware form, and can also be realized in a software functional module form. In addition, it should be further noted that the division of the modules in the embodiments of the present disclosure is schematic, and is only a logic function division, and there may be another division manner in actual implementation. For example, the functional blocks may be divided for the respective functions, or two or more functions may be integrated into one processing block.

With regard to the expression coefficient determining apparatus in the foregoing embodiment, the specific manner in which each module performs the operation and the beneficial effects thereof have been described in detail in the foregoing method embodiment, and are not described again here.

The embodiment of the disclosure also provides an electronic device. Fig. 8 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. The electronic device may be an expression coefficient determination apparatus and may include at least one processor 810, a communication bus 820, a memory 830, and at least one communication interface 840.

The processor 810 may be a Central Processing Unit (CPU), a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs according to the present disclosure. As an example, in conjunction with fig. 7, the information acquisition unit 710, the first determination unit 720, and the second determination unit 730 in the electronic device implement the same functions as the processor 810 in fig. 8.

Communication bus 820 may include a path that conveys information between the aforementioned components.

Communication interface 840, using any transceiver or like device, may be used to communicate with other devices or communication networks, such as servers, ethernet, radio Access Networks (RAN), wireless Local Area Networks (WLAN), etc. As an example of this, it is possible to provide,

the memory 830 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and connected to the processing unit by a bus. The memory may also be integrated with the processing unit. The memory 830 is used for storing application program codes for implementing the disclosed aspects, and is controlled by the processor 810. The processor 810 is configured to execute the application code stored in the memory 830 to implement the functions in the disclosed methods.

In particular implementations, processor 810 may include one or more CPUs, such as CPU0 and CPU1 in fig. 8, as one embodiment.

In particular implementations, an electronic device may include multiple processors, such as processor 810 and processor 850 in fig. 8, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In particular implementations, the electronic device may also include an input device 860 and an output device 870, as one embodiment. The input device 860 is in communication with the output device 870 and may accept user input in a variety of ways. For example, the input device 860 may be a mouse, keyboard, touch screen device, or sensing device, among others. An output device 870 is in communication with the processor 810 and may display information in a variety of ways. For example, the output device 870 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, or the like.

Those skilled in the art will appreciate that the configuration shown in fig. 8 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

The embodiment of the disclosure also provides an electronic device. The electronic device may be an expression coefficient determination apparatus. The electronic devices may vary widely in configuration or performance and may include one or more processors and one or more memories. At least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the method for determining an expression coefficient provided by the above method embodiments. Of course, the electronic device may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the electronic device may further include other components for implementing the functions of the device, which is not described herein again.

The present disclosure also provides a computer-readable storage medium including instructions stored thereon, which, when executed by a processor of a computer device, enable a computer to perform the method for determining an expression coefficient provided by the above-described illustrated embodiment. For example, the computer readable storage medium may be a memory 830 comprising instructions executable by the processor 810 of the terminal to perform the above-described method. Also for example, the computer-readable storage medium may be a memory comprising instructions executable by a processor of an electronic device to perform the above-described method. Alternatively, the computer readable storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present disclosure also provides a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the above-mentioned method of determining an expression coefficient.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An expression coefficient determination method, characterized by comprising:

acquiring video information, wherein the video information comprises multi-frame image information and audio information corresponding to each frame of the image information;

for current image information, under the condition that the current image information contains face information and the face information does not meet preset conditions, extracting a first expression coefficient from the face information, and predicting a second expression coefficient from audio information corresponding to the current image information, wherein the current image information is image information of any frame;

and obtaining a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient.

2. The method for determining the expression coefficient according to claim 1, wherein obtaining the target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient includes:

fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information;

and optimizing a fusion expression coefficient corresponding to the current image information according to a preset smooth coefficient and a target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

3. The method for determining the expression coefficient according to claim 2, wherein the fusing the first expression coefficient and the second expression coefficient to obtain a fused expression coefficient corresponding to the current image information includes:

and weighting the first expression coefficient and the second expression coefficient according to the preset weight of the first expression coefficient and the preset weight of the second expression coefficient to obtain the fusion expression coefficient, wherein the preset weight of the first expression coefficient is smaller than the preset weight of the second expression coefficient.

4. The method of claim 1, further comprising:

under the condition that the current image information does not contain face information, predicting a second expression coefficient from audio information corresponding to the current image information, and taking the second expression coefficient as a fusion expression coefficient corresponding to the current image information;

5. The method of claim 1, further comprising:

under the condition that the image information contains face information and the face information meets preset conditions, extracting a first expression coefficient from the face information, and taking the first expression coefficient as a fusion expression coefficient corresponding to the current image information;

and optimizing a fusion expression coefficient corresponding to the current image information according to a preset smoothing coefficient and a target expression coefficient corresponding to the previous frame of image information of the current image information to obtain the target expression coefficient corresponding to the current image information.

6. The method of claim 1, wherein after the obtaining the video information, the method further comprises:

identifying the face information in the current image information;

under the condition that the face information is identified, determining a posture angle and/or integrity corresponding to the face information, wherein the integrity characterizes whether the face information is shielded;

if the posture angle corresponding to the face information does not meet a preset angle and/or the integrity does not meet a preset integrity, determining that the image information contains the face information and the face information does not meet the preset condition;

and if the attitude angle corresponding to the face information meets a preset angle and/or the integrity meets a preset integrity, determining that the image information contains the face information and the face information meets the preset condition.

7. An expression coefficient determination apparatus, characterized in that the apparatus comprises:

an information acquisition unit configured to perform acquisition of video information including a plurality of frames of image information and audio information corresponding to each frame of the image information;

a first determining unit configured to extract, for current image information, a first expression coefficient from the face information when the current image information includes face information and the face information does not satisfy a preset condition, and predict a second expression coefficient from the audio information corresponding to the current image information, where the current image information is any frame of the image information;

and the second determining unit is configured to execute obtaining of a target expression coefficient corresponding to the current image information according to the first expression coefficient and the second expression coefficient.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1-6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of determining an expression coefficient of any of claims 1-6.

10. A computer program product, characterized in that it comprises computer instructions which, when run on an electronic device, cause the electronic device to carry out the method of determining an expression coefficient according to any one of claims 1-6.