CN116152400A

CN116152400A - Expression processing method and related device

Info

Publication number: CN116152400A
Application number: CN202111389932.6A
Authority: CN
Inventors: 孟婉婷; 卢胤婷
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2023-05-23

Abstract

The embodiment of the application discloses an expression processing method and a related device, wherein the method comprises the following steps: displaying a target avatar; collecting characteristic information of a user, wherein the characteristic information comprises voice information and image information; determining semantic information and voice characteristic parameters of the user according to the voice information, wherein the voice characteristic parameters are used for indicating the audio characteristics of the spoken language of the user; and generating a target dynamic expression of the target virtual image according to the image information, the voice characteristic parameters and target text information, wherein the target text information is used for representing part or all of text information corresponding to the semantic information. The method and the device are beneficial to improving the playability and the interestingness of the dynamic expression of the virtual image and improving the user experience.

Description

Expression processing method and related device

Technical Field

The application relates to the technical field of data processing, in particular to an expression processing method and a related device.

Background

Along with the development and application of artificial intelligence technology and the improvement of entertainment demands of users, an avatar display technology is developed and widely applied, and an avatar role can be presented on electronic equipment through operations such as image rendering model processing, and a user can control and record the expression actions of the avatar, so that an avatar expression package and the like are manufactured.

Disclosure of Invention

The embodiment of the application provides an expression processing method and a related device, so as to improve the playability and the interestingness of dynamic expression of an avatar and improve user experience.

In a first aspect, an embodiment of the present application provides an expression processing method, including:

displaying a target avatar;

collecting characteristic information of a user, wherein the characteristic information comprises voice information and image information;

determining semantic information and voice characteristic parameters of the user according to the voice information, wherein the voice characteristic parameters are used for indicating the audio characteristics of the spoken language of the user;

and generating a target dynamic expression of the target virtual image according to the image information, the voice characteristic parameters and target text information, wherein the target text information is used for representing part or all of text information corresponding to the semantic information.

In a second aspect, an embodiment of the present application provides an expression processing method, including:

displaying a target avatar;

detecting the selection operation of a user for a recording button, and recording voice information and image information of the user;

adjusting and displaying the picture content of each frame of image of the target virtual image according to the voice information and the image information;

Displaying a target frame image in the multi-frame images of the adjusted target avatar, wherein the target frame image comprises an expression state image and a text state image of the target avatar in a current frame, and the target frame is any frame in the multi-frame images.

In a third aspect, an embodiment of the present application provides an expression processing apparatus, including:

a display unit for displaying a target avatar;

the device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring characteristic information of a user, and the characteristic information comprises voice information and image information;

a determining unit, configured to determine semantic information and voice feature parameters of the user according to the voice information, where the voice feature parameters are used to indicate audio features of spoken language of the user;

the generation unit is used for generating a target dynamic expression of the target virtual image according to the image information, the voice characteristic parameters and target text information, wherein the target text information is used for representing part or all of text information corresponding to the semantic information.

In a fourth aspect, an embodiment of the present application provides an expression processing apparatus, including:

a first display unit for displaying a target avatar;

The recording unit is used for detecting the selection operation of a user for the recording button and recording the voice information and the image information of the user;

a second display unit for adjusting and displaying a picture content of each frame image of the target avatar according to the voice information and the image information;

and the third display unit is used for displaying a target frame image in the multi-frame images of the adjusted target virtual image, wherein the target frame image comprises an expression state image and a text state image of the target virtual image in a current frame, and the target frame is any frame in the multi-frame images.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a communication interface, where the processor, the memory, and the communication interface are connected to each other, and the communication interface is configured to receive or send data, and the memory is configured to store application program code for executing the method by the electronic device, and the processor is configured to execute any one of the method of the first aspect or the second aspect.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, a computer program for electronic data exchange, where the computer program causes a computer to perform some or all of the steps described in any of the methods of the first or second aspects of the embodiments of the present application.

In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program which when executed by a processor performs part or all of the steps as described in any of the methods of the first or second aspects of embodiments of the present application. The computer program product may be a software installation package.

In the embodiment of the application, the electronic equipment displays the target virtual image, acquires the voice information and the image information of the user, determines the semantic information and the voice characteristic parameter of the user according to the voice information, and generates the target dynamic expression of the target virtual image according to the image information, the voice characteristic parameter and the text information representing part or all of the text information corresponding to the semantic information. Therefore, when the electronic equipment generates the target dynamic expression of the virtual image, the electronic equipment collects the image information of the user and the voice information of the user, determines the semantic information and the voice characteristic parameters according to the voice information, and finally generates the dynamic expression according to the image information, the voice characteristic parameters and the text information of the text information for representing the semantic information, thereby being beneficial to improving the playability and the interestingness of the dynamic expression of the virtual image and improving the user experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1A is a schematic diagram of an electronic device according to an embodiment of the present application;

fig. 1B is a diagram illustrating an example of a composition structure of an electronic device according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an expression processing method according to an embodiment of the present application;

fig. 3A is a flowchart of another expression processing method according to an embodiment of the present application;

FIG. 3B is a diagram illustrating an example reference avatar selection interface provided by embodiments of the present application;

FIG. 3C is an exemplary diagram of an interface for displaying a target avatar provided in an embodiment of the present application;

FIG. 3D is a diagram illustrating an exemplary display interface for a target avatar target frame image according to an embodiment of the present application;

FIG. 3E is a diagram illustrating an exemplary multi-frame image save interface for a target avatar provided in accordance with an embodiment of the present application;

Fig. 3F is an exemplary diagram of a multi-frame image displaying a target avatar in an album interface according to an embodiment of the present application;

fig. 4A is a functional unit composition block diagram of an expression processing apparatus provided in an embodiment of the present application;

fig. 4B is a functional unit block diagram of another expression processing apparatus according to an embodiment of the present application;

fig. 5A is a functional unit block diagram of another expression processing apparatus according to an embodiment of the present application;

fig. 5B is a functional unit composition block diagram of another expression processing apparatus provided in an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

"plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Embodiments of the present application are described below with reference to the accompanying drawings.

Referring to fig. 1A, fig. 1A is a schematic diagram of an electronic device according to an embodiment of the present application. The electronic device 100 according to the embodiment of the present application may include an expression processing module, where the expression processing module may be configured to generate a dynamic expression of a target avatar, specifically, the expression processing module may display the target avatar, collect voice information and image information of a user, determine semantic information and voice feature parameters of the user according to the voice information, and generate a target dynamic expression of the target avatar according to the image information, the voice feature parameters, and text information representing part or all of text information corresponding to the semantic information. Specifically, the expression processing module can collect image information of a user through an image collection unit such as a camera on the electronic equipment, and can also collect voice information of the user through an audio collection unit such as a microphone on the electronic equipment.

The electronic device 100 in the present application may have a composition as shown in fig. 1B, and the electronic device 100 may include a processor 110, a memory 120, a communication interface 130, and one or more programs 121, where the one or more programs 121 are stored in the memory 120 and configured to be executed by the processor 110, and the one or more programs 121 include instructions for performing any of the steps of the method embodiments described above. Wherein the communication interface 130 is used to support communication of the electronic device 100 with other devices. In particular implementations, the processor 110 is configured to perform any of the steps performed by the electronic device in the method embodiments described below, and when performing data transmission such as sending, optionally invoke the communication interface 130 to perform the corresponding operations. It should be noted that the above schematic structural diagram of the electronic device 100 is merely an example, and more or fewer devices may be specifically included, which is not limited herein.

Referring to fig. 2, fig. 2 is a flowchart of an expression processing method provided in an embodiment of the present application, and as shown in fig. 2, the expression processing method includes the following steps:

s201, the electronic device displays the target avatar.

Before step 201, the electronic device may output a reference avatar selection interface, where a plurality of pre-stored reference avatars may be displayed, and when detecting a user selection operation for a target avatar among the plurality of reference avatars, the electronic device may jump to a first interface, and display the target avatar in the first interface. Alternatively, before step S201, the electronic device may detect that the user' S avatar creation request creates the target avatar (e.g., collect image information of the user to create the target avatar, or generate the target avatar in response to an adjustment operation for the initial avatar, etc.).

S202, the electronic equipment collects the characteristic information of the user.

Wherein the characteristic information comprises voice information and image information.

In a specific implementation, the electronic device may collect the feature information of the user when detecting the dynamic expression recording request for the target avatar (e.g., detecting that the recording button is selected after the target avatar is displayed in step S201). Accordingly, the electronic device may end collecting the feature information of the user when detecting a dynamic expression recording end request for the target avatar (e.g., detecting that a recording end button is selected, or detecting that the recording button is not selected again after detecting that the recording button is selected).

Further, before step S202, the electronic device may further output a prompt message, where the prompt message may include a type of the collected feature information and may be used to prompt the user to perform an operation, for example, the prompt message may be information such as "try-out talk generates a dynamic expression bar", "change your expression action try-out", and the like, which is not limited herein specifically.

S203, the electronic equipment determines semantic information and voice characteristic parameters of the user according to the voice information.

Wherein the speech feature parameter is used to indicate an audio feature of the spoken language of the user.

In the specific implementation, the semantic information of the user is determined according to the voice information of the user, namely, the voice information is identified and converted to determine the meaning expressed by the user. The audio feature of the spoken language of the user may be, for example, information such as the volume of the user, the speed of speech, etc.

S204, the electronic equipment generates a target dynamic expression of the target virtual image according to the image information, the voice characteristic parameters and the target text information.

The target text information is used for representing part or all of text information corresponding to the semantic information. And generating the target dynamic expression of the target virtual image according to the user image information, the voice characteristic parameters and the target text information, wherein the finally generated target dynamic expression can comprise display contents corresponding to the user image, the voice characteristic parameters of the user voice information and the text information corresponding to the user voice information, and compared with the dynamic expression of the target virtual image generated according to the user image information, the dynamic expression can display more abundant contents and has more playability and interestingness.

In addition, after the electronic device generates the target dynamic expression, the target dynamic expression can be stored in a preset storage area (for example, a storage area corresponding to an album), and a subsequent user can use the dynamic expression as a personalized expression package and the like. Specifically, the electronic device may directly save the target dynamic expression after generating the target dynamic expression, or may save the target dynamic expression after detecting a save request for the target dynamic expression, where a storage area in which the target dynamic expression is saved may be specified by a user, that is, a storage location in which the save request may carry the target dynamic expression.

It can be seen that, in the embodiment of the application, the electronic device displays the target avatar, collects the voice information and the image information of the user, determines the semantic information and the voice feature parameter of the user according to the voice information, and generates the target dynamic expression of the target avatar according to the image information, the voice feature parameter and the text information representing part or all of the text information corresponding to the semantic information. Therefore, when the electronic equipment generates the target dynamic expression of the virtual image, the electronic equipment collects the image information of the user and the voice information of the user, determines the semantic information and the voice characteristic parameters according to the voice information, and finally generates the dynamic expression according to the image information, the voice characteristic parameters and the text information of the text information for representing the semantic information, thereby being beneficial to improving the playability and the interestingness of the dynamic expression of the virtual image and improving the user experience.

In one possible example, the generating the target dynamic expression of the target avatar according to the image information, the voice feature parameter, and the target text information includes: controlling the facial expression of the target virtual image according to the image information, and generating a target expression state image of the target virtual image; generating a target text state image of the target virtual image according to the target text information, wherein the display effect of the target text information in the target text state image is determined according to the voice characteristic parameters; and generating the target dynamic expression according to the target expression state image and the target text state image.

In a specific implementation, the image information may include a facial image of the user, and the electronic device may further control a target expression of the target avatar by identifying a facial expression of the electronic device. Or in other embodiments, the image information may further include a limb image of the user, and when the target expression state image is generated, the electronic device may further control a limb motion of the target avatar according to the limb image of the user, so that the target avatar may synchronously present the limb motion and the expression motion of the user.

In the specific implementation, the electronic device generates the target dynamic expression according to the target expression state image and the target text state image, namely, the target expression state image layer and the target text image layer are overlapped to obtain the target dynamic expression which finally comprises the target expression state image and the target text image.

In this example, the electronic device generates the target expression state image and the target text state image of the target avatar according to the image information and the target text information, and then generates the target dynamic expression according to the target expression state image and the target text state image, wherein the display effect of the target text information in the target text state image is determined according to the voice feature parameters, so that the display content of the dynamic expression is enriched, the playability and the interestingness of the dynamic expression of the avatar are improved, and the user experience is improved.

In one possible example, the target text information includes at least one sub-text information, and the voice feature parameter includes a speech rate parameter; the generating the target text state image of the target avatar according to the target text information comprises the following steps: determining display start time of each piece of sub-text information in the at least one piece of sub-text information according to the speech rate parameter; and generating the target text state image according to the at least one piece of sub-text information and the display starting time, wherein different pieces of sub-text information in the target text state image are sequentially started to be displayed according to the display starting time.

In a specific implementation, the speech rate parameter may be specifically a time interval between two pieces of sub-voice information, when a user collects one piece of sub-voice information, the time interval between the sub-voice information and the last piece of sub-voice information may be determined, so as to control a display time interval between two pieces of sub-text information corresponding to the two pieces of sub-voice information respectively, after the electronic device determines a display time (for example, a time when voice data is collected for the first time) of the first piece of sub-text information, a display time of each piece of sub-text information may be determined according to a subsequent time interval between every two adjacent pieces of sub-text information, where the sub-voice information collection time interval and the sub-text information collection time interval may have a certain proportional relationship, for example, a one-to-one relationship, that is, a time when each piece of sub-text information is sequentially displayed is the same as the speech rate of the user. Taking each sub-text message as an example, all the words follow the user's speed of speech and are displayed word by word.

The sub-text information is ordered from front to back according to the collection time of the corresponding sub-voice information, and the display time of each sub-text information can be the display time of the previous sub-text information plus the determined display time interval between the two sub-text information.

In this example, the electronic device determines a display start time of each sub-text message in the at least one sub-text message according to the speech rate parameter, and generates a target text state image according to the at least one sub-text message and the display start time, and different sub-text messages in the target text state image start to be displayed sequentially according to the display start time, which is favorable for improving the matching between the text message and the audio feature of the user in the dynamic expression, and improving the dynamic expression display effect.

In one possible example, the method further comprises: if the display time interval of the first sub-text information and the second sub-text information with adjacent display time in the at least one sub-text information is larger than the preset time interval, determining that the display area of the second sub-text content is a first text box, wherein the first text box is different from a second text box corresponding to the first sub-text information.

In a specific implementation, considering that text information may be too long, if all the identified text information is displayed through the same text box, the displayed content may be not clear enough. Since the display start time of each sub-text message is determined according to the speech speed of the user, that is, there is a correspondence between the display time intervals of different sub-text messages and the intervals between different sub-speech messages in the speech of the user, if the display time of the sub-text message is greater than the preset interval, it is indicated that the interval between the corresponding two sub-speech messages is greater than the specific time interval, and if the two sub-text messages are displayed through different text boxes, it can be corresponding to the node of the speech message break of the user.

Specifically, the display position between the first text box and the second text box may be set so as not to overlap. In addition, the electronic device may set a display end time for each sub-text information, where the sub-text information is not displayed after the display end time is reached, for example, after determining the display time of the second sub-text information, the electronic device may determine that the first sub-text information and other information between the first sub-text information are not displayed, that is, the display positions of the first text box and the second text box may be the same, but the display times are different, and when the first text box is displayed, display of the second text box and the sub-text information therein is ended.

In this example, if the display time interval between the first sub-text information and the second sub-text information with adjacent display time in the at least one sub-text information is greater than the preset time interval, the electronic device determines that the display area of the second sub-text content is a first text box, where the first text box is different from the second text box corresponding to the first sub-text information, that is, the user text break is identified, and text information before and after the text break is displayed through different texts, which is favorable for improving the intelligence of text information display in dynamic expression.

In one possible example, the method further comprises: when a language conversion request carrying a target language mark aiming at the target dynamic expression is detected, converting the target text information into third text information, wherein the language of the third text information is the target language; determining mouth shape information of the target virtual image according to the third text information; controlling the facial expression of the target virtual image according to the mouth shape information and the image information, and generating a first expression state image of the target virtual image; generating a first text state image of the target virtual image according to the third text information, wherein the display effect of the third text information in the first text state image is determined according to the voice characteristic parameters; and generating a first dynamic expression of the target virtual image according to the first expression state image and the first text state image.

In a specific implementation, considering that the user may have a need to send the same dynamic expression to users in different languages, at this time, it may be troublesome to record a new dynamic expression again according to a new language, so when detecting a language conversion request, the electronic device may convert and translate the target text information into third text information of the target language, at this time, if only the text information is changed, the virtual image expression and the text information may not correspond to each other, so as to bring an uncoordinated visual feeling, so the electronic device may further determine the mouth shape information according to the third text information, and control the facial expression of the target virtual image according to the original image information and the mouth shape information to generate a first expression state image of the target virtual image, that is, in the first expression state image, the facial state of the target virtual image except the mouth shape is consistent with the target expression state image, but the mouth shape portion is adapted to the state image of the user adopting the target language to dictate the third text information, so that the dynamic expression is coordinated.

In addition, the determining of the display effect of the third text information according to the voice characteristic parameter may be consistent with the determining of the display effect of the target text according to the voice characteristic parameter, and the electronic device may determine the association relationship between different sub-translated text information in the third text information and different sub-text information in the target sub-text information, so that the display effect corresponding to a certain sub-text information in the target sub-text information may be directly determined as the display effect of the sub-translated text information corresponding to the certain sub-text information.

In this example, when the electronic device detects a language conversion request for a target dynamic expression, the electronic device may convert the target text information into third text information of a target language, determine mouth shape information of the target virtual image according to the third text information, control facial expression of the target virtual image according to the mouth shape information and the image information, generate a first expression state image, generate a first text state image according to the third text information, and finally generate a first dynamic expression of the target virtual image according to the first expression state image and the first text state image, so that the user does not need to record again to realize language conversion of the dynamic expression, and meanwhile, the display picture of the whole dynamic expression after conversion is coordinated, thereby being beneficial to improving intelligence of expression processing.

In one possible example, the method further comprises: outputting the target text information when a cutting request aiming at the target dynamic expression is detected; when a selection request for fourth text information in the target text information is detected, determining an expression fragment corresponding to the fourth text information from the target dynamic expression; and when a determination request for the expression segment is detected, generating a second dynamic expression according to the expression segment.

In a specific implementation, considering that a user may have a need of editing a target dynamic expression, when editing the target dynamic expression, if a frame-by-frame screen is difficult to search for a required segment, and then the time is long, therefore, the electronic device can output target text information first, compared with a frame of the screen, the text information is more direct and clear, the user can select the text information more simply and quickly, the user can directly select text content which the user wants to intercept from the text information, the electronic device can automatically intercept an expression segment corresponding to fourth text information, and finally, the second dynamic expression which is different from the original target dynamic expression is generated by cutting.

In this example, when the electronic device detects a cutting request for a target dynamic expression, outputting target text information, when detecting a selection request for fourth text information in the target text information, determining an expression segment corresponding to the fourth text information from the target dynamic expression, and when detecting a determination request for the expression segment, generating a second dynamic expression according to the expression segment, thereby being beneficial to improving convenience and efficiency of expression processing.

The embodiment of the present application provides another expression processing method, which is substantially the same as the expression processing method provided in the foregoing embodiment, so that details of the same are not repeated. The difference is that, in this embodiment, the voice information includes at least one piece of sub-voice information, the target text information includes at least one piece of sub-text information, and the voice feature parameters include: the volume parameters corresponding to the sub-voice information respectively; the generating the target text state image of the target avatar according to the target text information comprises the following steps: determining target sub-voice information corresponding to current sub-text information in the at least one sub-text information; determining the highlighting effect of the current sub-text information according to the volume parameter corresponding to the target sub-voice information, wherein the greater the volume parameter is, the more obvious the highlighting effect of the current sub-text information is; and generating the target text state image according to the current sub-text information and the highlighting effect of the current sub-text information.

Wherein, because the sub-text information and the semantic information have corresponding relations, the semantic information is determined according to the user voice information, that is, each sub-text information can be corresponding to one sub-voice information in the user voice information finally. The electronic device may determine first target sub-semantic information corresponding to the target current sub-text information in the semantic information, and then determine, according to the target sub-semantic information, the target sub-voice information from the user voice information.

The highlighting effect may include, for example, a font size and/or a font thickness and/or a font color, and the displaying effect of the current sub-text information may specifically be: if the highlighting effect is the font size, the larger the volume parameter of the target sub-voice corresponding to the current sub-text information is, the larger the font of the sub-text information is, if the highlighting effect is the font thickness, the larger the volume parameter of the target sub-voice is, the thicker the font corresponding to the sub-text information is, and if the highlighting effect is the font color, the larger the volume parameter of the target sub-voice is, and the higher the font color saturation corresponding to the sub-text information is. Alternatively, when the highlighting effect includes each of the foregoing display effects, the font size, font thickness, and font color of the current sub-text information may be determined simultaneously according to the volume parameter of the target sub-voice.

In the embodiment of the application, the electronic device generates the target expression state image and the target text state image of the target virtual image according to the image information and the target text information, and then generates the target dynamic expression according to the target expression state image and the target text state image, wherein when the target text state image is generated, the highlighting effect of the current sub-text information is determined according to the volume parameter corresponding to the target sub-voice information, and the target text state image is generated according to the highlighting effect of the current sub-text information and the highlighting effect of the current sub-text information, the larger the volume parameter is, the more obvious the highlighting effect of the current sub-text information is, the display content of the dynamic expression is enriched, the playability and the interestingness of the dynamic expression of the virtual image are improved, the highlighting effect of the target text information can be changed along with the volume parameter, the intelligence and the flexibility of the generation of the target dynamic expression are improved, and the user experience is further improved.

In addition, in other embodiments, the voice feature parameter may include the above-mentioned speech speed parameter and the above-mentioned volume parameter, that is, when the electronic device generates the target text status image according to the voice feature parameter and at least one sub-text information, the electronic device may determine a display start time of each sub-text information according to the speech speed parameter, and determine a highlighting effect of each sub-text information according to the volume parameter.

Referring to fig. 3A, fig. 3A is a flowchart of another expression processing method according to an embodiment of the present application, and as shown in fig. 3A, the expression processing method includes the following steps:

s401, the electronic equipment displays the target avatar.

Before step 201, the electronic device may output a reference avatar selection interface, where a plurality of pre-stored reference avatars may be displayed, and fig. 3B is an exemplary reference avatar selection interface, where a user may view different reference avatars by sliding or other manners, and the interface may further include a selection button 01, and when a selection operation of the user with respect to the selection button 01 is detected, the electronic device may jump to the first interface, and display the selected target avatar and the recording button.

In fig. 3B, the avatar a is a currently displayed reference avatar, and the user may display another avatar B after sliding the interface, and in addition, the selection interface may further include an avatar edit button 02, and the electronic device may jump to the avatar edit interface after detecting that the avatar edit button 02 is selected, and the user may adjust the target avatar.

S402, the electronic equipment detects the selection operation of a user for a recording button, and records voice information and image information of the user.

In step S402, the interface for displaying the target avatar by the electronic device may be as shown in fig. 3C, and at the same time, the recording button 03 is displayed in the interface, and when detecting the selection operation of the recording button by the user, the recording of the voice information and the image information of the user may be started, and after detecting that the selection operation of the recording button is canceled by the user, the collection of the feature information may be ended. The words "start recording" shown in the recording button 03 in fig. 3C are merely exemplary, and the words "start recording of electrodes" may be used in practical applications, which is not limited herein.

In addition, a return button 04, etc. may be displayed in the interface, and a reference avatar selection interface may be displayed in a jumped manner upon detection of a selection operation with respect to the return button 04. In addition, a color selection button 05 may be provided in the interface, and the user may click any one of the color buttons 05 in the interface to switch the background color of the avatar image (for example, may be a shadow area a).

S403, the electronic equipment adjusts and displays the picture content of each frame of image of the target virtual image according to the voice information and the image information.

S404, the electronic equipment displays the target frame image in the multi-frame images of the adjusted target avatar.

Referring to fig. 3D, fig. 3D illustrates 2 frame images of the target avatar displayed by the electronic device (the color selection button 05 and the recording button 03 in fig. 3D are the same as those in fig. 3C, and thus are not specifically identified in fig. 4D), wherein each frame image includes an expression status image of the target avatar a and a text status image (text in text box C identified in the drawing), it should be noted that the shape of the text box of the bubble box in fig. 3D is merely an example, and the shape of the text box in practical application may also be any shape such as a rectangular diamond frame, which is not specifically limited herein.

The target frame image comprises an expression state image and a text state image of the target virtual image in a current frame, and the target frame is any frame in the multi-frame image.

In the embodiment of the application, the electronic device displays the target virtual image, detects the selection operation of a user for the recording button, records the voice information and the image information of the user, adjusts and displays the picture content of each frame of image of the target virtual image according to the voice information and the image information, and finally displays the adjusted target frame image in the multi-frame image of the target virtual image.

In one possible example, the method further comprises: and detecting a selection operation of a user for a storage button, and storing the adjusted multi-frame image into a preset storage area.

The preset storage area may be a storage area corresponding to an album in the electronic device.

In a specific implementation, as shown in fig. 3E, the interface for displaying the target avatar image may further include a save button 06, and when the electronic device detects a selection operation of the save button 06 by a user, the electronic device may execute a save operation on the multi-frame image. Further, the save button may prompt the user to save the location, for example, in fig. 3E, the save button displays the typeface of "save to album", and of course, the typeface in the save button may be other typefaces, for example, "save", which is not limited herein.

In this example, the electronic device detects a selection operation of the save button by the user, and saves the adjusted multi-frame image to a preset storage area, so that the multi-frame image is convenient for multiple subsequent use.

In one possible example, after the detecting the selection operation of the save button by the user, the method further includes: and jumping to an album interface and displaying the multi-frame images.

For example, referring to fig. 3F, fig. 3F is a jumped album interface, in which a user may view multiple frames of images, and at the same time, the user may click a sharing button 07, a collection button 08, an image editing button 09, a deletion button 010, etc. in the album, and after detecting that the above buttons are selected, the electronic device may correspondingly perform a sharing operation, a collection operation, an image editing operation, a deletion operation, etc. on the multiple frames of images. The content in the dashed box of fig. 3F is an album thumbnail display area where a user can select to play or pause to play a plurality of frames of images.

In this example, after detecting the selection operation of the user on the save button, the electronic device may jump to the album interface and display the multi-frame image, so that the user can check the saved dynamic expression.

Referring to fig. 4A, fig. 4A is a functional unit block diagram of an expression processing apparatus according to an embodiment of the present application, and the apparatus 50 is applied to the electronic device shown in fig. 1A, and includes:

a display unit 501 for displaying a target avatar;

the acquisition unit 502 is configured to acquire feature information of a user, where the feature information includes voice information and image information;

a determining unit 503, configured to determine semantic information and voice feature parameters of the user according to the voice information, where the voice feature parameters are used to indicate audio features of spoken language of the user;

And a generating unit 504, configured to generate a target dynamic expression of the target avatar according to the image information, the voice feature parameter, and target text information, where the target text information is used to represent part or all of text information corresponding to the semantic information.

In one possible example, the determining unit 503 is specifically configured to: controlling the facial expression of the target virtual image according to the image information, and generating a target expression state image of the target virtual image; generating a target text state image of the target virtual image according to the target text information, wherein the display effect of the target text information in the target text state image is determined according to the voice characteristic parameters; and generating the target dynamic expression according to the target expression state image and the target text state image.

In one possible example, the target text information includes at least one sub-text information, and the voice feature parameter includes a speech rate parameter; in terms of the generating the target text state image of the target avatar according to the target text information, the determining unit 503 is specifically configured to: determining display start time of each piece of sub-text information in the at least one piece of sub-text information according to the speech rate parameter; and generating the target text state image according to the at least one piece of sub-text information and the display starting time, wherein different pieces of sub-text information in the target text state image are sequentially started to be displayed according to the display starting time.

In one possible example, the speech information includes at least one sub-speech information, the target text information includes at least one sub-text information, and the speech feature parameters include: the volume parameters corresponding to the sub-voice information respectively; in terms of the generating the target text state image of the target avatar according to the target text information, the determining unit 503 is specifically configured to: determining target sub-voice information corresponding to current sub-text information in the at least one sub-text information; determining the highlighting effect of the current sub-text information according to the volume parameter corresponding to the target sub-voice information, wherein the greater the volume parameter is, the more obvious the highlighting effect of the current sub-text information is; and generating the target text state image according to the current sub-text information and the highlighting effect of the current sub-text information.

In a possible example, the apparatus 50 further includes a first processing unit, configured to determine, if a display time interval between a first sub-text message and a second sub-text message, where display times are adjacent in the at least one sub-text message, is greater than a preset time interval, that a display area of the second sub-text content is a first text box, where the first text box is different from a second text box corresponding to the first sub-text message.

In a possible example, the apparatus 50 further includes a second processing unit, configured to, when detecting a language conversion request carrying a target language identifier for the target dynamic expression, convert the target text information into third text information, where a language of the third text information is the target language; determining mouth shape information of the target virtual image according to the third text information; controlling the facial expression of the target virtual image according to the mouth shape information and the image information, and generating a first expression state image of the target virtual image; generating a first text state image of the target virtual image according to the third text information, wherein the display effect of the third text information in the first text state image is determined according to the voice characteristic parameters; and generating a first dynamic expression of the target virtual image according to the first expression state image and the first text state image.

In one possible example, the apparatus 50 further includes a third processing unit configured to output the target text information when a cut request for the target dynamic expression is detected; when a selection request for fourth text information in the target text information is detected, determining an expression fragment corresponding to the fourth text information from the target dynamic expression; and when a determination request for the expression segment is detected, generating a second dynamic expression according to the expression segment.

In the case of using an integrated unit, a functional unit composition block diagram of another expression processing apparatus provided in the embodiment of the present application is shown in fig. 4B. In fig. 4B, the expression processing apparatus includes: a processing module 510 and a communication module 511. The processing module 510 is configured to control and manage actions of the expression processing apparatus, for example, steps performed by the display unit 501, the acquisition unit 502, the determination unit 503, the generation unit 504, and/or other processes for performing the techniques described herein. The communication module 511 is used to support interaction between the expression processing apparatus and other devices. As shown in fig. 4B, the expression processing apparatus may further include a storage module 512, where the storage module 512 is configured to store program codes and data of the expression processing apparatus.

The processing module 510 may be a processor or controller, such as a central processing unit (Central Processing Unit, CPU), a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor may also be a combination that performs the function of a computation, e.g., a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 511 may be a transceiver, an RF circuit, a communication interface, or the like. The storage module 512 may be a memory.

All relevant contents of each scenario related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein. The expression processing apparatus may perform the steps performed by the electronic device in the expression processing method shown in fig. 2.

Referring to fig. 5A, fig. 5A is a functional unit block diagram of yet another expression processing apparatus according to an embodiment of the present application, and the apparatus 60 is applied to the electronic device shown in fig. 1A, and includes:

a first display unit 601 for displaying a target avatar;

a recording unit 602, configured to detect a selection operation of a recording button by a user, and record voice information and image information of the user;

a second display unit 603 for adjusting and displaying a picture content of each frame image of the target avatar according to the voice information and the image information;

and a third display unit 604, configured to display a target frame image in the adjusted multi-frame images of the target avatar, where the target frame image includes an expression state image and a text state image of the target avatar in a current frame, and the target frame is any frame in the multi-frame images.

In a possible example, the apparatus 60 further includes a first processing unit, configured to detect a selection operation of the save button by a user, and save the adjusted multi-frame image to a preset storage area.

In a possible example, the apparatus 60 further includes a second processing unit for jumping to an album interface and displaying the multi-frame image after the detection of the user's selection operation of the save button.

In the case of using an integrated unit, a functional unit composition block diagram of still another expression processing apparatus provided in the embodiment of the present application is shown in fig. 5B. In fig. 5B, the expression processing apparatus includes: a processing module 610 and a communication module 611. The processing module 610 is configured to control and manage actions of the expression processing apparatus, for example, steps performed by the first display unit 601, the recording unit 602, the second display unit 603, the third display unit 604, and/or other processes for performing the techniques described herein. The communication module 611 is used to support interaction between the expression processing apparatus and other devices. As shown in fig. 5B, the expression processing apparatus may further include a storage module 612, where the storage module 612 is configured to store program codes and data of the expression processing apparatus.

The processing module 610 may be a processor or controller, such as a central processing unit (Central Processing Unit, CPU), a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor may also be a combination that performs the function of a computation, e.g., a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The communication module 611 may be a transceiver, an RF circuit, a communication interface, or the like. The memory module 612 may be a memory.

All relevant contents of each scenario related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein. The expression processing apparatus may perform the steps performed by the electronic device in the expression processing method shown in fig. 3A.

The present application also provides a computer storage medium storing a computer program for electronic data exchange, the computer program causing a computer to execute some or all of the steps of any one of the methods described in the method embodiments above.

Embodiments of the present application also provide a computer program product, including a computer program, which when executed by a processor, implements some or all of the steps of any of the methods described in the embodiments of the methods described above. The computer program product may be a software installation package.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments and that the acts and elements referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An expression processing method, characterized by comprising:

displaying a target avatar;

2. The method of claim 1, wherein the generating the target dynamic expression of the target avatar from the image information, the voice feature parameters, and target text information comprises:

controlling the facial expression of the target virtual image according to the image information, and generating a target expression state image of the target virtual image;

generating a target text state image of the target virtual image according to the target text information, wherein the display effect of the target text information in the target text state image is determined according to the voice characteristic parameters;

and generating the target dynamic expression according to the target expression state image and the target text state image.

3. The method of claim 2, wherein the target text information comprises at least one sub-text information and the speech characteristic parameter comprises a speech rate parameter; the generating the target text state image of the target avatar according to the target text information comprises the following steps:

Determining display start time of each piece of sub-text information in the at least one piece of sub-text information according to the speech rate parameter;

and generating the target text state image according to the at least one piece of sub-text information and the display starting time, wherein different pieces of sub-text information in the target text state image are sequentially started to be displayed according to the display starting time.

4. The method of claim 2, wherein the speech information comprises at least one sub-speech information, the target text information comprises at least one sub-text information, and the speech feature parameters comprise: the volume parameters corresponding to the sub-voice information respectively; the generating the target text state image of the target avatar according to the target text information comprises the following steps:

determining target sub-voice information corresponding to current sub-text information in the at least one sub-text information;

determining the highlighting effect of the current sub-text information according to the volume parameter corresponding to the target sub-voice information, wherein the greater the volume parameter is, the more obvious the highlighting effect of the current sub-text information is;

and generating the target text state image according to the current sub-text information and the highlighting effect of the current sub-text information.

5. A method according to claim 3, characterized in that the method further comprises:

if the display time interval of the first sub-text information and the second sub-text information with adjacent display time in the at least one sub-text information is larger than the preset time interval, determining that the display area of the second sub-text content is a first text box, wherein the first text box is different from a second text box corresponding to the first sub-text information.

6. The method according to claim 2, wherein the method further comprises:

when a language conversion request carrying a target language mark aiming at the target dynamic expression is detected, converting the target text information into third text information, wherein the language of the third text information is the target language;

determining mouth shape information of the target virtual image according to the third text information;

controlling the facial expression of the target virtual image according to the mouth shape information and the image information, and generating a first expression state image of the target virtual image;

generating a first text state image of the target virtual image according to the third text information, wherein the display effect of the third text information in the first text state image is determined according to the voice characteristic parameters;

And generating a first dynamic expression of the target virtual image according to the first expression state image and the first text state image.

7. The method according to any one of claims 1-6, further comprising:

outputting the target text information when a cutting request aiming at the target dynamic expression is detected;

when a selection request for fourth text information in the target text information is detected, determining an expression fragment corresponding to the fourth text information from the target dynamic expression;

and when a determination request for the expression segment is detected, generating a second dynamic expression according to the expression segment.

8. An expression processing method, characterized by comprising:

displaying a target avatar;

9. The method of claim 8, wherein the method further comprises:

and detecting a selection operation of a user for a storage button, and storing the adjusted multi-frame image into a preset storage area.

10. The method of claim 9, wherein after the detecting the selection operation of the save button by the user, the method further comprises:

and jumping to an album interface and displaying the multi-frame images.

11. An expression processing apparatus, characterized by comprising:

a display unit for displaying a target avatar;

12. An expression processing apparatus, characterized by comprising:

a first display unit for displaying a target avatar;

13. An electronic device comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-7 or 8-10.

14. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the steps in the method according to any of claims 1-7 or 8-10.