CN117372588A

CN117372588A - Method and device for generating mouth image

Info

Publication number: CN117372588A
Application number: CN202210770424.0A
Authority: CN
Inventors: 吴贺康
Original assignee: Perfect World Beijing Software Technology Development Co Ltd
Current assignee: Perfect World Beijing Software Technology Development Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-01-09

Abstract

The embodiment of the invention provides a method and a device for generating a mouth shape image. The method comprises the following steps: for the dubbing material to be processed, identifying a plurality of virtual objects contained in the dubbing material and text information corresponding to each of the plurality of virtual objects; for the text information corresponding to each virtual object, analyzing the text information to obtain pinyin phonemes for forming pronunciation of each text in the text information; acquiring a mouth shape corresponding to the pinyin phonemes according to a preset mapping relation between the pinyin phonemes and the mouth shape; the mouth shape is mapped into a skeletal model of each of the plurality of virtual objects by a deformer to generate a mouth shape image that matches the plurality of virtual objects. The method converts the dubbing materials of the plurality of virtual objects into pinyin phonemes forming pronunciation of each character, converts the pinyin phonemes into mouth shape images of the plurality of virtual objects, improves the mouth shape image generation efficiency, enables the mouth shape to be closer to the actual mouth shape when the Chinese pronunciations, and optimizes the audio-visual effect of the mouth shape images.

Description

Method and device for generating mouth image

Technical Field

The present invention relates to the field of image technologies, and in particular, to a method and an apparatus for generating a mouth shape image.

Background

In scenes such as games, videos, network live broadcast and the like, mouth shape animation corresponding to the character audio needs to be adapted for the virtual character, so that mouth shape actions in the mouth shape animation are matched with pronunciation in the character audio, and the authenticity of the virtual character is improved. Virtual character colors are, for example, game characters, character characters in movie works, avatars of anchor in webcast, and the like.

In the related art, most of the mouth shape animation effects of the virtual characters are poor due to the fact that Chinese pronunciation rules are not supported, so that the mouth shape animation of the virtual characters is usually made manually by related technicians at present. In the mouth shape animation production scheme, relevant technicians collect facial data of actors through a face capturing technology, and then produce mouth shape animation by combining with setting of virtual characters on the basis of the facial data. The mouth shape animation generation mode has low automation degree and poor animation production efficiency, and is difficult to deal with large-scale virtual character mouth shape animation generation scenes. In summary, how to automatically generate the mouth animation of the virtual character is a technical problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating a mouth shape image, which are used for realizing automatic conversion from dubbing materials to the mouth shape image, improving the generation efficiency of the mouth shape image and optimizing the audio-visual effect of the mouth shape image.

In a first aspect, an embodiment of the present invention provides a method for generating a mouth shape image, including:

for the dubbing material to be processed, identifying a plurality of virtual objects contained in the dubbing material and text information corresponding to each of the plurality of virtual objects, wherein the dubbing material comprises audio data and/or text data manufactured for the plurality of virtual objects;

for the text information corresponding to each virtual object, analyzing the text information to obtain pinyin phonemes for forming pronunciation of each text in the text information, wherein the pinyin phonemes comprise initial consonant phonemes and/or final phonemes;

according to a preset mapping relation between the pinyin phonemes and the mouth shapes, the mouth shapes corresponding to the pinyin phonemes are obtained, wherein the mouth shapes comprise initial consonant mouth shapes and/or final sound mouth shapes;

the mouth shape is mapped into a skeletal model of each of the plurality of virtual objects by a deformer to generate mouth shape images that match the plurality of virtual objects.

In one possible embodiment, for a dubbing material to be processed, identifying a plurality of virtual objects included in the dubbing material and text information corresponding to each of the plurality of virtual objects includes:

receiving audio data input by a user; identifying a plurality of virtual objects from input audio data through cloud computing, and extracting text information corresponding to each of the plurality of virtual objects from the audio data; or receiving text data edited and input by a user and extracting text information corresponding to each of the plurality of virtual objects from the text data.

In one possible embodiment, the method further comprises:

acquiring a dubbing speed corresponding to the dubbing material; if the dubbing speech speed is greater than a preset speech speed threshold, selecting keywords from the text information from the audio data or the text data, and taking the selected keywords as the text information.

In a possible embodiment, according to a preset mapping relationship between pinyin phonemes and mouth shapes, obtaining a mouth shape corresponding to the pinyin phonemes includes:

judging whether the pinyin phonemes belong to a combination sound; acquiring a basic pinyin phoneme consistent with or similar to the pinyin phoneme based on the judging result; and acquiring a basic mouth shape corresponding to the basic pinyin phonemes based on the mapping relation between the basic pinyin phonemes and the basic mouth shape, and taking the basic mouth shape corresponding to the basic pinyin phonemes as the mouth shape corresponding to the pinyin phonemes.

In one possible embodiment, obtaining a base pinyin phone that is consistent with or similar to the pinyin phone based on the determination includes:

if the pinyin phonemes belong to the combination phones, splitting the pinyin phonemes to obtain a plurality of combination phonemes forming corresponding pronunciation of the pinyin phonemes; searching target phonemes matched with each of the plurality of combined phonemes from a pinyin phoneme set contained in the basic pinyin configuration table, wherein the pinyin phoneme set comprises a basic pinyin phoneme and an approximate pinyin phoneme similar to the basic pinyin phoneme in mouth shape during pronunciation; and for each matched target phoneme of the plurality of combined phonemes, acquiring a basic pinyin phoneme corresponding to the pinyin phoneme group where each target phoneme is located, and combining each basic pinyin phoneme into a basic pinyin phoneme corresponding to the pinyin phoneme.

if the pinyin phonemes do not belong to the combination phones, searching a target phoneme matched with the pinyin phonemes from a pinyin phoneme group contained in the basic pinyin configuration table, wherein the pinyin phoneme group comprises the basic pinyin phonemes and approximate pinyin phonemes similar to the basic pinyin phonemes in mouth shape during pronunciation; and taking the basic pinyin phonemes corresponding to the pinyin phoneme group in which the target phonemes are positioned as the basic pinyin phonemes corresponding to the pinyin phonemes.

In one possible embodiment, the base pinyin phonemes in the base pinyin configuration table include at least one of b, d, g, i, zh, z, u; wherein the pinyin phone set corresponding to b includes an approximate pinyin phone b, p, m, f, the pinyin phone set corresponding to d includes an approximate pinyin phone d, t, n, l, the pinyin phone set corresponding to g includes approximate pinyin phones g, k, h, the pinyin phone set corresponding to i includes an approximate pinyin phone j, q, x, y, the pinyin phone set corresponding to zh includes an approximate pinyin phone zh, ch, sh, r, the pinyin phone set corresponding to z includes approximate pinyin phones z, c, s, and the pinyin phone set corresponding to u includes an approximate pinyin phone w.

In one possible embodiment, detecting whether the pinyin phonemes adjacent to each other in the pinyin phonemes meet a set condition; if the adjacent phonetic phonemes accord with the set condition, judging whether the pronunciation type of the adjacent phonetic phonemes belongs to plosive; judging whether the pronunciation types of the phonetic phonemes adjacent to each other belong to the stop sounds or not; if the pronunciation type of the phonetic phonemes adjacent to each other does not belong to plosive and does not belong to stop, the continuous mouth shapes corresponding to the phonetic phonemes adjacent to each other are aligned.

In one possible embodiment, the setting conditions include at least one of: the phonetic phonemes adjacent to each other are overlapped words; the phonetic phonemes adjacent to each other contain the same final phonemes or final phonemes with similar pronunciation opening types.

In one possible embodiment, in response to an instruction for exporting the mouth shape image, exporting the mouth shape image into an animation export scheme corresponding to an item in which a plurality of virtual objects are located, wherein the plurality of virtual objects are a plurality of game characters in the item.

In one possible embodiment, the method further comprises: identifying pause marks in dubbing materials; segmenting the text information according to the pause mark to obtain text in the previous position of the pause mark in the text information; and carrying out delayed closing processing on the mouth shape corresponding to the character in the front position of the pause mark in the pinyin phonemes so as to enable the mouth shape image to accord with the pause rhythm of the dubbing material.

In one possible embodiment, the method further comprises: acquiring volume information of dubbing materials; according to the volume information, adjusting the mouth shape variation amplitude of the virtual object in the mouth shape image; wherein, the larger the volume, the larger the mouth-shaped variation amplitude of the virtual object.

In one possible embodiment, the method further comprises: carrying out semantic recognition on dubbing materials; judging whether the dubbing material accords with preset conditions or not based on the identification result; if the dubbing material meets the preset condition, adding a specific visual element associated with the virtual object in the mouth shape image, wherein the specific visual element comprises facial expression and/or action bound with the bone model.

In one possible embodiment, the association of the virtual object with the specific visual element includes: the association relationship between the virtual object and the specific visual element; and/or association relation between preset sentences of the virtual object and specific visual elements; and/or association between a preset scenario in the dubbing material and a specific visual element.

In a second aspect, an embodiment of the present invention provides a mouth shape image generating device, including:

the recognition module is used for recognizing a plurality of virtual objects and text information corresponding to the virtual objects contained in the dubbing material for the dubbing material to be processed, wherein the dubbing material comprises audio data and/or text data manufactured for the virtual objects;

The analysis module is used for analyzing the text information corresponding to each virtual object to obtain pinyin phonemes for forming pronunciation of each text in the text information, wherein the pinyin phonemes comprise initial consonant phonemes and/or final phonemes;

the acquisition module is used for acquiring the mouth shape corresponding to the pinyin phonemes according to the preset mapping relation between the pinyin phonemes and the mouth shape, wherein the mouth shape comprises an initial consonant mouth shape and/or a final mouth shape;

and the generating module is used for mapping the mouth shape into the skeleton model of each of the plurality of virtual objects through the deformer so as to generate mouth shape images matched with the plurality of virtual objects.

Embodiments of the present invention also provide a system including a processor and a memory having at least one instruction, at least one program, code set, or instruction set stored therein, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by the processor to implement the method of generating a mouth shape image as described above.

Embodiments of the present invention provide a computer readable medium having stored thereon at least one instruction, at least one program, code set, or instruction set, loaded and executed by a processor to implement the method of generating a mouth shape image described above.

In the embodiment of the invention, audio data and/or text data made for the virtual object are referred to as dubbing materials. For the dubbing material to be processed, firstly, a plurality of virtual objects and text information corresponding to the virtual objects contained in the dubbing material are identified, and the dubbing material comprises audio data and/or text data manufactured for the virtual objects. And analyzing the text information corresponding to each virtual object for the text information corresponding to each virtual object to obtain a pinyin phoneme for forming pronunciation of each text in the text information, wherein the pinyin phoneme comprises an initial consonant phoneme and/or a final sound phoneme. Because the phonetic phonemes forming each character pronunciation conform to the basic rule of Chinese phonetic, the mouth shape of the virtual object is more similar to the actual mouth shape during Chinese pronunciation. Based on the principle, according to a preset mapping relation between the pinyin phonemes and the mouth shapes, the mouth shapes corresponding to the pinyin phonemes are obtained, wherein the mouth shapes comprise initial consonant mouth shapes and/or final sound mouth shapes; further, the mouth shape corresponding to the pinyin phonemes is mapped to the bone model of each of the plurality of virtual objects by the deformer to generate mouth shape images matching the plurality of virtual objects. According to the embodiment of the invention, the respective text information of the plurality of virtual objects in the dubbing material is converted into the pinyin phonemes conforming to the Chinese pinyin rules, and the pinyin phonemes are converted into the mouth-shaped images of the plurality of virtual objects, so that the mouth-shaped images conforming to the Chinese pronunciation rules can be generated through the conversion process, the automatic conversion from the dubbing material to the mouth-shaped images is completed, the problem of poor animation production efficiency caused by manually producing the mouth-shaped images in the related technology is effectively avoided, the production efficiency of the mouth-shaped images is greatly improved, and the batch production requirement of the mouth-shaped images in practical application is met. According to the embodiment of the invention, through the conversion process from the pinyin phonemes forming each character pronunciation to the mouth shape of the virtual object, the mouth shape of the virtual object is more similar to the actual mouth shape during Chinese pronunciation, and is more natural and smooth, so that the synchronism and accuracy of the mouth shape image and the dubbing material are greatly improved, and the audio-visual effect of the mouth shape image is optimized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for generating a mouth shape image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a mouth shape image according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a mouth shape image generating device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device corresponding to the mouth shape image generating device provided in the embodiment shown in fig. 3.

Detailed Description

The present disclosure will now be discussed with reference to several exemplary embodiments. It should be understood that these embodiments are discussed only to enable those of ordinary skill in the art to better understand and thus practice the teachings of the present invention, and are not meant to imply any limitation on the scope of the invention.

As used herein, the term "comprising" and variants thereof are to be interpreted as meaning "including but not limited to" open-ended terms. The term "based on" is to be interpreted as "based at least in part on". The terms "one embodiment" and "an embodiment" are to be interpreted as "at least one embodiment. The term "another embodiment" is to be interpreted as "at least one other embodiment".

In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.

At present, in scenes such as games, videos, network live broadcast and the like, a mouth shape animation corresponding to character audio needs to be adapted to a virtual character, so that mouth shape actions in the mouth shape animation are matched with pronunciation in the character audio, and the authenticity of the virtual character is improved. Virtual character colors are, for example, game characters, character characters in movie works, avatars of anchor in webcast, and the like.

The applicant found that most of the related art does not support the Chinese pronunciation rules, which results in poor mouth shape animation effect of the virtual character, so that the scheme of manually making the mouth shape animation of the virtual character by the related technician is still the main scheme at present. In the mouth shape animation production scheme, relevant technicians collect facial data of actors through a face capturing technology, and then produce mouth shape animation by combining with setting of virtual characters on the basis of the facial data.

The applicant found that this type of mouth shape animation generation method is low in automation degree, poor in animation production efficiency, and difficult to cope with large-scale virtual character mouth shape animation generation scenes. For example, in a game development project, since facial expression changes of different game characters have style differences, mouth-shaped animations of different game characters cannot be reused, and a related technician needs to produce mouth-shaped animations for different game characters in a game respectively, so that animation production efficiency is poor, and game development efficiency is greatly reduced.

In summary, how to automatically generate the mouth animation of the virtual character is a technical problem to be solved.

The mouth shape image generation scheme provided by the embodiment of the invention can be executed by an electronic device, and the electronic device can be a terminal device such as a smart phone, a tablet computer, a PC (personal computer), a notebook computer and the like. In an alternative embodiment, the electronic device may have an application installed thereon for executing the mouth shape image generation scheme. Alternatively, in another alternative embodiment, the mouth shape image generating scheme may be performed by the server device and the terminal device in cooperation.

For example, assume that a first service program is loaded with a virtual scene. The electronic device may be implemented as a second service program for exposing virtual characters (i.e., virtual objects) in a virtual scene. The second service program can be connected to the first service program, and based on the second service program, the mouth shape image of the virtual character loaded by the first service program can be produced and adjusted, and the virtual object loaded by the first service program is displayed in real time. Real-time presentation is understood herein to mean presentation of each frame of mouth-shape image in the virtual character in real-time.

In practical applications, the first service program is, for example, a virtual scene editor or a game editor, and the second service program is, for example, a plug-in installed on the first service program. Of course, the second service program may be an application program independent of the first service program, besides the plug-in, and the present invention is not limited thereto.

The scheme provided by the embodiment of the invention is suitable for various mouth shape image making scenes, such as a virtual object mouth shape image generating scene, a modifying scene and an optimizing scene. For example, mouth-shaped image production scenes in the fields of games, movie works, web living broadcast, and the like.

In view of the foregoing technical problems, in some embodiments of the present invention, a solution is provided, and in the following, the technical solutions provided by the embodiments of the present invention are described in detail with reference to the accompanying drawings.

The following describes the execution of the mouth shape image producing method with reference to the following embodiments. Fig. 1 is a flowchart of a method for generating a mouth shape image according to an embodiment of the present invention. As shown in fig. 1, the mouth shape image generation method includes the steps of:

101. for the dubbing material to be processed, identifying a plurality of virtual objects contained in the dubbing material and text information corresponding to each of the plurality of virtual objects;

102. for the text information corresponding to each virtual object, analyzing the text information to obtain pinyin phonemes for forming pronunciation of each text in the text information;

103. acquiring a mouth shape corresponding to the pinyin phonemes according to a preset mapping relation between the pinyin phonemes and the mouth shape;

104. The mouth shape is mapped into a skeletal model of each of the plurality of virtual objects by a deformer to generate mouth shape images that match the plurality of virtual objects.

The method for generating the mouth shape image in the embodiment of the invention is applied to an application program, and the application program can be arranged on the terminal equipment. The application is loaded with virtual objects in the virtual scene. The virtual object is implemented as, for example, a game character in a game, a character in a movie work, an avatar of a main cast in a live network, or the like.

101, identifying a plurality of virtual objects and text information corresponding to the virtual objects in the dubbing material for the dubbing material to be processed. In the embodiment of the invention, the dubbing material comprises, but is not limited to, audio data and/or text data corresponding to a plurality of virtual objects. Taking a virtual character in a game as an example, the dubbing material can be a dubbing file corresponding to the virtual character or a dialogue text corresponding to the virtual character. Optionally, dubbing material may be entered based on a checkpoint or based on a virtual character.

In the embodiment of the present invention, after receiving the audio data and/or the text data input by the user, a plurality of virtual objects may be identified from the audio data and/or the text data, and data segments corresponding to the plurality of virtual objects may be divided from the audio data and/or the text data as dubbing materials. For example, a plurality of characters included in the dubbing text are identified by cloud computing, and the corresponding dubbing text is extracted from the dubbing file according to different characters. Further optionally, the different roles correspond to different types of skeletal model parameters, or different skeletal models are tied to different roles, such that the mouth-shaped animation of the role has the action styles of different types of roles by the skeletal model parameters or skeletal models.

In an alternative embodiment, audio data input by a user is received, a plurality of virtual objects are identified from the input audio data through cloud computing, and text information corresponding to each of the plurality of virtual objects is extracted from the audio data. In practice, the audio data is, for example, an audio file of each virtual character in a game or movie work.

Specifically, in the above embodiment, dubbing files of multiple characters may be imported, and dubbing text corresponding to each character may be obtained through speech recognition processing. Or, dubbing files of multiple roles can be imported, such as dialogue audio of multiple roles in a certain checkpoint, or guiding voice triggered by different roles under the same checkpoint; and extracting corresponding dubbing text from the dubbing file according to the roles through cloud computing. By the material acquisition mode, corresponding text information can be extracted from the audio data, so that the difficulty of subsequent material processing is reduced, and the generation efficiency of the virtual character mouth shape image is further improved. Optionally, other parameter information may be extracted from the audio data, such as, but not limited to, the name, path, duration, volume, etc. of the audio file, and the parameter information may be presented to the user. For example, the dubbing text and parameter information such as the name, path, duration, volume and the like of the dubbing file are extracted from the dubbing file, so that the parameter information can be checked and edited in the audio panel.

In another alternative embodiment, text data edited by a user and input a plurality of virtual objects is received, and text information corresponding to each of the plurality of virtual objects is extracted from the text data. For example, in a text panel, text information input by user editing is received, and the text information is aligned with a time axis. Optionally, receiving a start time, an end time or a corresponding mouth shape image duration of the text information input by the user, and adjusting a corresponding relation between the text information and a time axis based on the time information. Optionally, the corresponding relation between the automatically acquired text information and the time axis can be adjusted in the text panel to align the text information and the time axis, so that the synchronous effect of the mouth shape images is further optimized.

In practice, if the number of words of the text information is too large, the mouth shape may be dithered, because there is not enough presentation time allocated to the mouth shape image corresponding to each word. In short, the transition of the mouth-shaped image is too fast and can be visually very shaky. For example, the dubbing speech speed is too fast, the emotion of the character is excited, or the speech speed of the character setting is fast, which easily causes that the mouth shape image and the audio are not synchronous.

For this case, the duration of the mouth shape image needs to be optimized so as to give each mouth shape action a sufficient presentation duration. For example, only the mouth shape image corresponding to the keyword in the text information is displayed, thereby reducing the time period required for the mouth shape image. In the embodiment of the present invention, optionally, after the text information is obtained in 101, the dubbing speech speed corresponding to the dubbing material is also obtained. If the dubbing speech speed is greater than a preset speech speed threshold, selecting keywords from the text information from the audio data or the text data, and taking the selected keywords as the text information. Therefore, the problem that the mouth shape image and the audio are not synchronous is solved by a keyword extraction mode, so that the mouth shape image is smoother. For example, the text information is "do this with one's mind", and if the dubbing speed of the sentence is too fast, then the keywords therein, such as "one", "do", must be extracted.

Further, after the text information is obtained, the text information corresponding to each virtual object is analyzed to obtain pinyin phonemes for each text pronunciation in the text information in 102. Specifically, each text included in the text information corresponding to each virtual object is identified, including but not limited to Chinese characters, numbers, foreign language words. In the following, a Chinese character in the text information is mainly taken as an example for explanation, and after the Chinese character is identified, the pinyin corresponding to the Chinese character can be searched for, so as to obtain the pinyin phonemes for each Chinese character in the text information. If the number of the found pinyin is a plurality of, such as multi-tone words, the corresponding pinyin is selected according to the context of the Chinese character, so as to obtain the corresponding pinyin phonemes. In practical application, the corresponding phonetic phoneme list is also preconfigured for the digits and the foreign language words, so that phonetic phonemes corresponding to the digits or the foreign language words in the text information can be conveniently searched from the phonetic phoneme list. Because the phonetic phonemes forming each character pronunciation accord with the basic rule of Chinese phonetic, the mouth shape of the virtual object is generated based on the phonetic phonemes, and the generated mouth shape is more similar to the actual mouth shape in the process of Chinese pronunciation. In the embodiment of the invention, based on the basic rule of Chinese pinyin, the pinyin phonemes comprise initial consonant phonemes and/or final phonemes. For example, the pinyin phonemes may be ai, including two vowel phonemes a and i. For example, the pinyin phonemes may be yi, including the initials phonemes y and the finals phonemes i.

103, according to the mapping relation between the pre-set pinyin phonemes and the mouth shapes, the mouth shapes corresponding to the pinyin phonemes are obtained. In the embodiment of the invention, the mouth shape comprises an initial mouth shape and/or a final mouth shape.

In order to meet the manufacturing requirement of large-scale mouth shape images, in the embodiment of the invention, basic pinyin phonemes are used as basic elements for constructing pronunciation of each character, namely, the pinyin phonemes corresponding to each character pronunciation are simply formed through the basic pinyin phonemes, so that the basic pinyin phonemes actually needed to be obtained are replaced or combined, the number of basic pinyin phonemes matched with the basic mouth shape is reduced, and the animation manufacturing efficiency is further improved. For example, the pinyin phoneme of ai can be formed by adopting two basic pinyin phonemes of a and i, and the mouth shapes mapped by the two basic pinyin phonemes are combined, so that the mouth shape mapped by the pinyin phoneme of ai does not need to be independently manufactured, and the mouth shape manufacturing efficiency is further improved.

Based on the above principles, the base pinyin phonemes in the base pinyin configuration table may optionally include, but are not limited to, one or more of b, d, g, i, zh, z, u. It can be understood that the approximate pinyin phonemes referred to in the present invention are pinyin phonemes having a higher degree of mouth shape similarity when the corresponding basic pinyin phonemes are pronounced. For example, the b (wave), p (slope), m (touch) and f (Buddha) pinyin are plosive, and the mouth shape similarity of the b pinyin is high when the b pinyin is pronounced, so that the b pinyin is taken as a basic pinyin, and p, m and f are taken as approximate pinyin phonemes of b to form a pinyin phoneme group corresponding to b.

In the embodiment of the invention, the basic pinyin configuration table comprises at least one pinyin factor group. The pinyin phone set includes a basic pinyin phone and an approximate pinyin phone having a similar mouth shape when the basic pinyin phone is pronounced, specifically, the pinyin phone set corresponding to b includes itself and its corresponding approximate pinyin phones p, m, f, the pinyin phone set corresponding to d includes itself and its corresponding approximate pinyin phones t, n, l, the pinyin phone set corresponding to g includes itself and its corresponding approximate pinyin phones k, h, the pinyin phone set corresponding to i includes itself and its corresponding approximate pinyin phones j, q, x, y, the pinyin phone set corresponding to zh includes itself and its corresponding approximate pinyin phones ch, sh, r, the pinyin phone set corresponding to z includes itself and its corresponding approximate pinyin phones c, s, and the pinyin phone set corresponding to u includes itself and its corresponding approximate pinyin phones w.

In addition, a mode of splitting combined sound can be adopted to restore the compound vowels through single vowels and/or single initials, for example, ai can be obtained by combining single vowels a and i, iou can be obtained by combining single vowels i, o and u, and ing can be obtained by combining single vowels i and single initials o and u. Thus, by generalizing the approximate pronunciation and the combined sound splitting, 47 initials and finals can be restored by using 10 pinyin phonemes through the basic pinyin configuration table.

In an alternative embodiment, 103, according to a preset mapping relationship between pinyin phonemes and mouth shapes, the mouth shapes corresponding to the pinyin phonemes are obtained, which may be implemented as:

Specifically, in the above steps, it is necessary to determine whether the pinyin phonemes currently processed belong to a combination sound, and if so, it is necessary to split the pinyin phonemes further so that a more accurate mouth shape can be obtained later. If not, the mouth shape of the required mapping can be directly obtained. Through the steps, the mouth shape combination mode conforming to the Chinese pinyin rule can be realized, so that the mouth shape mapping process is more conforming to the actual pronunciation, and the mouth shape is more natural and smooth.

In the above step, if the pinyin phonemes belong to a combination sound, the basic pinyin phonemes that are consistent with or approximate to the pinyin phonemes are obtained based on the determination result, which may be implemented as: splitting the pinyin phonemes to obtain a plurality of combined phonemes forming corresponding pronunciation of the pinyin phonemes; searching target phonemes matched with each of the plurality of combined phonemes from a pinyin phoneme set contained in the basic pinyin configuration table, wherein the pinyin phoneme set comprises a basic pinyin phoneme and an approximate pinyin phoneme similar to the basic pinyin phoneme in mouth shape during pronunciation; and for each matched target phoneme of the plurality of combined phonemes, acquiring a basic pinyin phoneme corresponding to the pinyin phoneme group where each target phoneme is located, and combining each basic pinyin phoneme into a basic pinyin phoneme corresponding to the pinyin phoneme.

For example, assuming that the pinyin phone is ling, in this case, the pinyin phone needs to be split to obtain 4 combined phones, i.e. l, i, n, g, which form the corresponding pronunciation of the pinyin phone. Further, from the pinyin-phoneme set included in the basic pinyin configuration table, target phonemes that match each of the 4 combined phonemes are found. Taking the basic pinyin configuration table described above as an example, the basic pinyin phones corresponding to the pinyin phone group where l, i, n, g is located, i.e., d (i is the basic pinyin phone in the pinyin phone group where l is located), i (i is the basic pinyin phone), d (n is the basic pinyin phone in the pinyin phone group), g (g is the basic pinyin phone) and d, i, d, g are combined into the basic pinyin phone corresponding to the pinyin phone.

In the above step, if the pinyin phonemes do not belong to the combination sounds, the basic pinyin phonemes consistent with or similar to the pinyin phonemes are obtained based on the determination result, which may be implemented as: searching a target phoneme matched with the pinyin phonemes from the pinyin phoneme group contained in the basic pinyin configuration table; and taking the basic pinyin phonemes corresponding to the pinyin phoneme group in which the target phonemes are positioned as the basic pinyin phonemes corresponding to the pinyin phonemes.

For example, assuming that the pinyin phone is a, in this case, the target phone matching a may be directly found from the pinyin phone group included in the basic pinyin configuration table; and taking a corresponding to the Pinyin phoneme group (namely a basic Pinyin phoneme) of the target phoneme as a basic Pinyin phoneme corresponding to the Pinyin phoneme.

According to the two modes, the basic pinyin phonemes corresponding to the pinyin phonemes can be accurately obtained, so that corresponding pinyin can be completed through fewer basic pinyin phonemes, the number of mouth shapes required to be configured in the mouth shape image is reduced, and the mouth shape image generation efficiency is further improved.

In practical application, since a Chinese character usually corresponds to not only one but a plurality of mouth shapes, the embodiment of the invention actually needs to combine the mouth shapes to obtain a mouth shape image. For this purpose, for some special cases in the text information, such as overlapping words and stop-sounds, corresponding processing is also required. The overlapped word comprises at least two adjacent phonetic phonemes with the same pronunciation, for example, the two phonetic phonemes are open heart and comfortable to wear, and the mouth shape of the at least two phonetic phonemes has small change. In addition to overlapping words, pinyin phonemes that contain the same vowels are also more similar in mouth shape. For example, the pinyin phonemes of the four words have i vowels, so that the four words do not change much in mouth shape. For example, the Pinyin phonemes of "o" in "Kai" and "Kai" of the previous word have the vowels of "a", so that the two words have little variation in the mouth shape. Of course, the overlapping words are also special, such as mom and dad, and the closed-loop sound is not the continuous sound with small mouth shape change. Here, plosive is also classified into air-feed tones and air-non-feed tones, collectively referred to as corkt. Thus, detection of a stop-cock is also required.

For the above reasons, in the embodiment of the present invention, optionally, it may also be detected whether the pinyin phonemes adjacent to each other in the position of the pinyin phonemes meet the set condition. Optionally, the setting condition may be that the pinyin phonemes adjacent to each other are overlapping words, or that the pinyin phonemes adjacent to each other contain the same vowel phonemes or vowel phonemes with similar pronunciation mouth shapes. If the adjacent pinyin phonemes meet the set conditions, judging whether the pronunciation types of the adjacent pinyin phonemes belong to plosive sounds or not; judging whether the pronunciation types of the phonetic phonemes adjacent to each other belong to the stop sounds or not; if the pronunciation type of the phonetic phonemes adjacent to each other does not belong to plosive and does not belong to stop, the continuous mouth shapes corresponding to the phonetic phonemes adjacent to each other are aligned. Therefore, the pronunciation mouth shape of the overlapped word is more in accordance with the actual mouth shape change, and the mouth shape image is more natural.

104, mapping the mouth shape into a skeletal model of each of the plurality of virtual objects by a deformer to generate a mouth shape image matching the plurality of virtual objects. Specifically, 104, mapping the mouth shape into the bone model of each of the plurality of virtual objects by adopting a deformer corresponding to the mouth shape so as to obtain bone model parameters corresponding to each of the plurality of virtual objects; and calculating based on the bone model parameters to obtain a mouth shape image. Such as vertex parameters, etc. For example, the mouth shape image matching the dubbing material may be the mouth shape image shown in fig. 2. Optionally, the virtual object and the skeleton model can be bound, so that the mouth shape animation of the virtual object is more in accordance with the set action style by configuring skeleton model parameters or configuring different skeleton models.

In the embodiment of the invention, the deformer comprises a mapping relation between the pronunciation mouth shape and the bone model. The matching of the deformer and the virtual object can be visual style matching, for example, the mouth shape image style of the virtual object is matched with the mapping relation set in the deformer, so that the mouth shape image unified with the mouth shape image style of the virtual object can be obtained through the deformer. That is, by setting parameters related to the mapping relationship between the pronunciation mouth shape and the bone model, the deformer can have mouth shape image styles required by different virtual objects, so that the deformer can be multiplexed to different virtual objects. The mouth-shaped image style is, for example, cartoon style, pictorial style, ink style, pixel style.

In practical applications, when a sentence is end or a speaking gap is encountered, a phenomenon of pausing or stopping of pronunciation often occurs, and most people usually speak the habit of lengthening the pronunciation of the last word before pausing or extending the pronunciation in action. Therefore, in the embodiment of the present invention, this feeling of pause is also expressed in the mouth shape image.

For this case, optionally, a pause identification in the dubbing material is identified; segmenting the text information according to the pause mark to obtain text in the previous position of the pause mark in the text information; and carrying out delayed closing processing on the mouth shape corresponding to the character in the front position of the pause mark in the pinyin phonemes so as to enable the mouth shape image to accord with the pause rhythm of the dubbing material. Delayed closure processing, such as achieving the effect of delayed closure by extending presentation time on a mouth-shaped image of the last word, for example, for "do you happy today? In this way, the preset display time can be prolonged on the word of 'mock', so that the effect of simulating the pause feeling is achieved.

It should be noted that, in practical applications, the specific implementation manner of the delayed closing process may also be adjusted according to the language style of the virtual object. For example, assuming that the language style set by the virtual object is a model, for this case, a pause identifier may be dynamically inserted into the text information according to the current state of the virtual object, for example, if the current emotion of the virtual object is more excited, a pause identifier with a preset multiple is inserted into the text information to simulate the model effect corresponding to the model. For another example, assuming that the language style set by the virtual object is a urgent one, for this case, the presentation time of the last word before each pause can be shortened, so as to simulate the corresponding mouth shape effect in cooperation with a faster speech speed.

In addition to the pause feeling, the embodiment of the invention can also correspondingly process the characters needing to be emphasized in the character information so as to simulate the mouth shape corresponding to the emphasized language. For example, the characters to be emphasized in the dubbing material are identified, the characters to be emphasized are marked, if the marks are reread, the action change amplitude of the corresponding mouth shape is adjusted according to the marks in the character information, if the characters to be reread are marked, the action change amplitude of the corresponding mouth shape is enlarged.

In practical application, in order to send out larger volume, a mouth shape with larger variation amplitude is often adopted, so that the volume in the dubbing material can be extracted in the embodiment of the invention to adjust the mouth shape variation amplitude in the mouth shape image. Specifically, volume information of dubbing materials is obtained; and adjusting the mouth shape variation amplitude of the virtual object in the mouth shape image according to the volume information. Wherein, the larger the volume, the larger the mouth-shaped variation amplitude of the virtual object.

In the above or below embodiments, optionally, a specific visual element is associated for the virtual object, including but not limited to facial expressions and/or actions bound to the skeletal model, thereby establishing an association of the virtual object with the specific visual element. Specifically, the association of the virtual object with the particular visual element includes, but is not limited to, one or more of the following: the association relation between the virtual object and the specific visual element, the association relation between the preset statement of the virtual object and the specific visual element, and the association relation between the preset scenario in the dubbing material and the specific visual element. Wherein a particular visual element, such as a facial expression and/or action, may be achieved by setting bone model parameters in the bone model. Taking a game development project as an example, for different game characters in the game project, association relations between the facial expressions bound with the skeleton models of the game characters can be respectively established for the game characters so as to obtain a facial expression list associated with the game characters.

It will be appreciated that the facial expressions and/or actions described above as being tied to the skeletal model may be targeted for different virtual objects. Of course, besides setting exclusive facial expressions and/or actions for the virtual object, if the game characters conform to the setting conditions, for example, the game characters belong to the same series, or the game characters belong to the same scenario branch, then the facial expressions and/or actions bound respectively can be multiplexed between the skeleton models of the game characters, so that migration of the facial expressions and/or actions among the game characters is facilitated, and the manufacturing efficiency of the mouth shape image is further improved. In practice, the same facial expression and/or action, after binding to skeletal models of different virtual objects (e.g., game characters), may be associated with different sentences in different virtual object dubbing materials. For example, assuming that the facial expression of the eyebrow is associated with preset sentences of a plurality of virtual objects, the facial expression may be respectively associated with different sentences of different virtual objects, for example, "true" in the dubbing material of the virtual object a? "and" do not get bar? "associated". Of course, the same facial expression and/or action may also be associated with the same sentence or the same scenario in multiple virtual objects. For example, a facial expression of a picked eyebrow may be associated with a "true mock" of multiple virtual objects. For example, a facial expression of a picked eyebrow may also be associated with a scenario in which multiple virtual objects encounter monster 1 in checkpoint 1, i.e., if any of the virtual objects are detected to encounter monster 1 in checkpoint 1, the presentation of that facial expression is triggered.

In practical applications, the specific facial expression is, for example, a signboard expression set for a game character, or a facial expression obtained by adjusting attribute parameters of the game character, and of course, may also be a personalized setting of a player for the game character, for example, a facial expression obtained by a face pinching operation. Specifically, the resulting facial expressions, including but not limited to eyebrows, smiles, blinks, pucker, and the like, are adjusted for the attribute parameters of the game character. Similarly, the specific action is, for example, a signboard action set for the game character or an action adjusted to the attribute parameter of the game character. Of course, the specific action may also be a personalized setting of the player for the game character, such as a player-specific action derived by interaction with the player or parsing player preference data.

Optionally, semantic recognition is performed on the dubbing material, and whether the dubbing material meets preset conditions is judged based on the recognition result. If the dubbing material meets the preset condition, adding a specific visual element associated with the virtual object in the mouth shape image, wherein the specific visual element comprises facial expression and/or action. Specifically, the facial expression and/or action associated with the virtual object may be added to the mouth shape image synchronized with the dubbing material based on the association relationship between the virtual object and the facial expression and/or action. In practice, the preset conditions include, but are not limited to: the dubbing material comprises a preset sentence, the dubbing material belongs to a preset virtual object, and the dubbing material belongs to a preset game development project or game character series. Through the steps, personalized setting of the mouth shape image of the virtual object can be realized, more visual elements related to the setting or attribute parameters of the virtual object are added to the mouth shape image, and therefore the visual effect and the manufacturing efficiency of the mouth shape image of the virtual object are further improved.

For example, it is assumed that the preset condition is that the dubbing material belongs to a preset virtual object, and the dubbing material includes a preset sentence. Let us assume that the preset sentence "why" of the virtual object a is associated with the eyebrow (facial expression). Based on the above assumption, first, it is detected whether the dubbing material belongs to a preset virtual object a, and whether "why" (i.e., a preset sentence) is contained in the dubbing material. If the dubbing material is detected to belong to a preset virtual object a and contains ' why woolen, based on the association relation between the preset sentence of the virtual object a and the eyebrow (namely, the facial expression), the eyebrow expression associated with the virtual object a is added into the mouth-shaped image synchronous with the preset sentence in the dubbing material, namely, the ' why woolen '.

Alternatively, the steps may be: suppose that virtual object b is associated with a smile (i.e., facial expression). Based on the above, whether the dubbing material belongs to a preset virtual object b is detected, and if the dubbing material belongs to the preset virtual object b is detected, based on the association relationship between the virtual object b and the smile, the smile expression associated with the virtual object b is added in the mouth shape image synchronized with the dubbing material. The smile expression may be any position added in the mouth shape image of the virtual object b, for example, at the end or start of each sentence conversation.

In the above or below embodiments, optionally, after 104, in response to an instruction for exporting the mouth shape image, mouth shape images of the multiple virtual characters are exported as an animation export scheme corresponding to the item where the multiple virtual objects are located, so that the mouth shape image is subsequently applied to a specific scene. And then, the mouth shape images of the multiple virtual roles can be applied or adjusted by calling the animation export scheme, so that the editing efficiency of the mouth shape images of the multiple virtual roles is improved.

Wherein the plurality of virtual objects are a plurality of game characters in the item. Specifically, assuming that the mouth shape image is a mouth shape animation, a derivation scheme conforming to the mouth shape animation can be selected from a plurality of animation derivation schemes built in the application program. In this embodiment, the export mode of an AnimSequence format file supported by a illusion engine (UE) is generally taken as a default animation export scheme. Optionally, in the animation export mode list field, an animation export scheme supported or disabled by the current device may also be viewed, and in the mode description field, a detailed explanation of the animation export scheme may be viewed.

In the execution process of the mouth shape image generation method shown in fig. 1, the respective text information of a plurality of virtual objects in the dubbing material is converted into pinyin phonemes conforming to the pinyin rule, and the pinyin phonemes are converted into mouth shape images of the plurality of virtual objects, so that the mouth shape images conforming to the Chinese pronunciation rule can be generated through the conversion process, the automatic conversion from the dubbing material to the mouth shape images of the plurality of virtual objects is completed, the problem of poor animation production efficiency caused by manually producing the mouth shape images in the related art is effectively avoided, the production efficiency of the mouth shape images is greatly improved, and the mass production requirement of the mouth shape images in practical application is met. In addition, through the conversion process from the phonetic phonemes forming each character pronunciation to the mouth shape of the virtual object, the mouth shape of the virtual object is more similar to the actual mouth shape during Chinese pronunciation, and is more natural and smooth, so that the synchronism and accuracy of the mouth shape image and the dubbing material are greatly improved, and the audio-visual effect of the mouth shape image is optimized.

A mouth shape image generating device according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these oral image generating devices may each be configured by the steps taught by the present solution using commercially available hardware components.

Fig. 3 is a schematic structural diagram of a mouth shape image generating device according to an embodiment of the present invention. The mouth shape image generating device is applied to a server, and as shown in fig. 3, the mouth shape image generating device includes: an acquisition module 11 and a generation module 12. Alternatively, the mouth shape image generating device is applied to an application program for loading a virtual object.

The identifying module 11 is configured to identify, for a dubbing material to be processed, a plurality of virtual objects and text information corresponding to each of the plurality of virtual objects, where the dubbing material includes audio data and/or text data made for the plurality of virtual objects;

the parsing module 12 is configured to parse the text information corresponding to each virtual object to obtain pinyin phonemes for forming pronunciation of each text in the text information, where the pinyin phonemes include initial phonemes and/or final phonemes;

the obtaining module 13 is configured to obtain a mouth shape corresponding to the pinyin phonemes according to a mapping relationship between the pinyin phonemes and the mouth shape, where the mouth shape includes an initial consonant mouth shape and/or a final mouth shape;

A generating module 14 is configured to map, by the deformer, the mouth shape into a skeletal model of each of the plurality of virtual objects, to generate mouth shape images that match the plurality of virtual objects.

Optionally, for the dubbing material to be processed, the identifying module 11 is specifically configured to, when identifying a plurality of virtual objects included in the dubbing material and text information corresponding to each of the plurality of virtual objects:

receiving audio data input by a user; identifying a plurality of virtual objects from input audio data through cloud computing, and extracting text information corresponding to each of the plurality of virtual objects from the audio data; or receiving text data edited and input by a user and corresponding to the plurality of virtual objects, and extracting text information corresponding to the plurality of virtual objects from the text data.

Optionally, the identification module 11 is further configured to:

Optionally, the obtaining module 13 is specifically configured to, when obtaining the mouth shape corresponding to the pinyin phoneme according to the mapping relationship between the preset pinyin phoneme and the mouth shape:

Alternatively, when the obtaining module 13 obtains the basic pinyin phonemes consistent with or similar to the pinyin phonemes based on the determination result, the obtaining module is specifically configured to:

Optionally, the base pinyin phonemes in the base pinyin configuration table include at least one of b, d, g, i, zh, z, u; wherein the pinyin phone set corresponding to b includes an approximate pinyin phone b, p, m, f, the pinyin phone set corresponding to d includes an approximate pinyin phone d, t, n, l, the pinyin phone set corresponding to g includes approximate pinyin phones g, k, h, the pinyin phone set corresponding to i includes an approximate pinyin phone j, q, x, y, the pinyin phone set corresponding to zh includes an approximate pinyin phone zh, ch, sh, r, the pinyin phone set corresponding to z includes approximate pinyin phones z, c, s, and the pinyin phone set corresponding to u includes an approximate pinyin phone w.

Optionally, the system further comprises a detection module for detecting whether the adjacent pinyin phonemes in the position of the pinyin phonemes meet the set condition; if the adjacent phonetic phonemes accord with the set condition, judging whether the pronunciation type of the adjacent phonetic phonemes belongs to plosive; judging whether the pronunciation types of the phonetic phonemes adjacent to each other belong to the stop sounds or not; if the pronunciation type of the phonetic phonemes adjacent to each other does not belong to plosive and does not belong to stop, the continuous mouth shapes corresponding to the phonetic phonemes adjacent to each other are aligned.

Optionally, the setting condition includes at least one of: the phonetic phonemes adjacent to each other are overlapped words; the phonetic phonemes adjacent to each other contain the same final phonemes or final phonemes with similar pronunciation opening types.

Optionally, the method further comprises a deriving module, which is used for responding to a deriving instruction of the mouth shape image, deriving the mouth shape image into an animation deriving scheme corresponding to the project where the plurality of virtual objects are located, wherein the plurality of virtual objects are a plurality of game characters in the project.

Optionally, the system further comprises a pause module for: identifying pause marks in dubbing materials; segmenting the text information according to the pause mark to obtain text in the previous position of the pause mark in the text information; and carrying out delayed closing processing on the mouth shape corresponding to the character in the front position of the pause mark in the pinyin phonemes so as to enable the mouth shape image to accord with the pause rhythm of the dubbing material.

Optionally, the device further comprises an adjustment module for: acquiring volume information of dubbing materials; according to the volume information, adjusting the mouth shape variation amplitude of the virtual object in the mouth shape image; wherein, the larger the volume, the larger the mouth-shaped variation amplitude of the virtual object.

Optionally, the system further comprises a semantic recognition module for: carrying out semantic recognition on dubbing materials; judging whether the dubbing material accords with preset conditions or not based on the identification result; if the dubbing material meets the preset condition, adding a specific visual element associated with the virtual object in the mouth shape image, wherein the specific visual element comprises facial expression and/or action bound with the bone model.

Optionally, the association relationship between the virtual object and the specific visual element includes: the association relationship between the virtual object and the specific visual element; and/or association relation between preset sentences of the virtual object and specific visual elements; and/or association between a preset scenario in the dubbing material and a specific visual element.

The mouth shape image generating device shown in fig. 3 may perform the method provided in the foregoing embodiments, and for the parts of this embodiment not described in detail, reference may be made to the related description of the foregoing embodiments, which are not described herein.

In one possible design, the structure of the mouth shape image generating device shown in fig. 3 described above may be implemented as an electronic device.

As shown in fig. 4, the electronic device may include: a processor 21, and a memory 22. Wherein said memory 22 has stored thereon executable code which, when executed by said processor 21, at least enables said processor 21 to implement a mouth shape image generating method as provided in the previous embodiments. The electronic device may further include a communication interface 23 for communicating with other devices or a communication network.

The apparatus embodiments described above are merely illustrative, wherein the various modules illustrated as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The systems, methods and apparatus of embodiments of the present invention may be implemented as pure software (e.g., a software program written in Java), as pure hardware (e.g., a special purpose ASIC chip or FPGA chip), or as a system that combines software and hardware (e.g., a firmware system with fixed code or a system with general purpose memory and a processor), as desired.

Another aspect of the invention is a computer readable medium having stored thereon computer readable instructions which, when executed, may implement the method of generating a mouth shape image according to embodiments of the invention.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The scope of the claimed subject matter is limited only by the following claims.

Claims

1. A method of generating a mouth shape image, comprising:

identifying a plurality of virtual objects contained in dubbing materials and text information corresponding to the virtual objects for the dubbing materials to be processed, wherein the dubbing materials comprise audio data and/or text data manufactured for the virtual objects;

Analyzing the text information corresponding to each virtual object to obtain pinyin phonemes for forming pronunciation of each text in the text information, wherein the pinyin phonemes comprise initial consonant phonemes and/or final phonemes;

obtaining a mouth shape corresponding to the pinyin phonemes according to a preset mapping relation between the pinyin phonemes and the mouth shape, wherein the mouth shape comprises an initial consonant mouth shape and/or a final sound mouth shape;

the mouth shape is mapped into a skeletal model of each of the plurality of virtual objects by a deformer to generate a mouth shape image that matches the plurality of virtual objects.

2. The method according to claim 1, wherein for the dubbing material to be processed, identifying a plurality of virtual objects included in the dubbing material and text information corresponding to each of the plurality of virtual objects includes:

receiving the audio data input by a user; identifying a plurality of virtual objects from the input audio data through cloud computing, and extracting text information corresponding to each of the plurality of virtual objects from the audio data; or alternatively

Text data edited and input by a user into a plurality of virtual objects is received, and text information corresponding to each of the plurality of virtual objects is extracted from the text data.

3. The method as recited in claim 2, further comprising:

acquiring a dubbing speed corresponding to the dubbing material;

and if the dubbing speech speed is greater than a preset speech speed threshold, selecting keywords from the text information from the audio data or the text data, and taking the selected keywords as the text information.

4. The method of claim 1, wherein the obtaining the mouth shape corresponding to the pinyin phonemes according to the mapping relationship between the pinyin phonemes and the mouth shape, which is preset, includes:

judging whether the pinyin phonemes belong to a combination sound or not;

acquiring a basic pinyin phoneme consistent with or similar to the pinyin phoneme based on a judging result;

and acquiring a basic mouth shape corresponding to the basic pinyin phonemes based on the mapping relation between the basic pinyin phonemes and the basic mouth shape, and taking the basic mouth shape corresponding to the basic pinyin phonemes as the mouth shape corresponding to the pinyin phonemes.

5. The method of claim 4, wherein the obtaining a base pinyin phone that is consistent with or similar to the pinyin phone based on the determination comprises:

if the pinyin phonemes belong to the combined sound, splitting the pinyin phonemes to obtain a plurality of combined phonemes forming corresponding pronunciation of the pinyin phonemes;

Searching target phonemes matched with each of a plurality of combined phonemes from a pinyin phoneme set contained in a basic pinyin configuration table, wherein the pinyin phoneme set comprises the basic pinyin phonemes and approximate pinyin phonemes which are similar to the basic pinyin phonemes in mouth shape during pronunciation;

and for each matched target phoneme of the plurality of combined phonemes, acquiring a basic pinyin phoneme corresponding to the pinyin phoneme group where each target phoneme is located, and combining each basic pinyin phoneme into a basic pinyin phoneme corresponding to the pinyin phoneme.

6. The method of claim 4, wherein the obtaining a base pinyin phone that is consistent with or similar to the pinyin phone based on the determination comprises:

if the pinyin phonemes do not belong to the combination sound, searching a target phoneme matched with the pinyin phonemes from a pinyin phoneme group contained in a basic pinyin configuration table, wherein the pinyin phoneme group comprises the basic pinyin phonemes and approximate pinyin phonemes which are similar to the basic pinyin phonemes in mouth shape during pronunciation;

and taking the basic pinyin phonemes corresponding to the pinyin phoneme group in which the target phonemes are positioned as the basic pinyin phonemes corresponding to the pinyin phonemes.

7. The method of claim 4, wherein the base pinyin phonemes in the base pinyin configuration table include at least one of b, d, g, i, zh, z, u;

wherein the pinyin phone set corresponding to b includes an approximate pinyin phone b, p, m, f, the pinyin phone set corresponding to d includes an approximate pinyin phone d, t, n, l, the pinyin phone set corresponding to g includes approximate pinyin phones g, k, h, the pinyin phone set corresponding to i includes an approximate pinyin phone j, q, x, y, the pinyin phone set corresponding to zh includes an approximate pinyin phone zh, ch, sh, r, the pinyin phone set corresponding to z includes approximate pinyin phones z, c, s, and the pinyin phone set corresponding to u includes an approximate pinyin phone w.

8. The method as recited in claim 1, further comprising:

detecting whether the adjacent pinyin phonemes in the position of the pinyin phonemes accord with a set condition or not;

if the adjacent pinyin phonemes accord with the set conditions, judging whether the pronunciation types of the adjacent pinyin phonemes belong to plosive sounds or not; and

judging whether the pronunciation types of the phonetic phonemes adjacent to the position belong to the stop sounds or not;

and if the pronunciation type of the phonetic phonemes adjacent to the position does not belong to plosive and does not belong to stop sounds, aligning the continuous mouth shapes corresponding to the phonetic phonemes adjacent to the position.

9. The method as recited in claim 1, further comprising:

carrying out semantic recognition on the dubbing material;

judging whether the dubbing material accords with a preset condition or not based on the identification result;

and if the dubbing material meets the preset condition, adding a specific visual element associated with the virtual object in the mouth shape image, wherein the specific visual element comprises a facial expression and/or action bound with a skeleton model.

10. A mouth shape image generating device, characterized in that the device comprises:

the acquisition module is used for acquiring the mouth shape corresponding to the pinyin phonemes according to the preset mapping relation between the pinyin phonemes and the mouth shape, wherein the mouth shape comprises an initial consonant mouth shape and/or a final sound mouth shape;

And the generating module is used for mapping the mouth shape into a skeleton model of each of the plurality of virtual objects through a deformer so as to generate mouth shape images matched with the plurality of virtual objects.