CN118155603A

CN118155603A - Pronunciation correction method, electronic device, apparatus and storage medium for virtual digital person

Info

Publication number: CN118155603A
Application number: CN202211576222.9A
Authority: CN
Inventors: 涂勇军; 江秀; 常晶
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2024-06-07

Abstract

The embodiment of the application discloses a pronunciation correction method, electronic equipment, a device and a storage medium for a virtual digital person, wherein the pronunciation correction method for the virtual digital person can comprise the following steps: acquiring initial pronunciation information of the content to be broadcasted, wherein the initial pronunciation information comprises initial speech speed information and/or initial lip shape information; correcting the initial pronunciation information according to the first pronunciation correction library to obtain target pronunciation information; and controlling the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information. By implementing the method, the personalized pronunciation of the virtual digital person can be realized.

Description

Pronunciation correction method, electronic device, apparatus and storage medium for virtual digital person

Technical Field

The present application relates to the field of electronic devices, and in particular, to a method for correcting pronunciation of a virtual digital person, an electronic device, a device, and a storage medium.

Background

The development of industry, the virtual digital person is the "digital twin" of the true person constructed by applying advanced technology, and is also the interactive interface of the metauniverse user in the metauniverse space. It exists depending on the display device, and has three features: appearance of a person, behavior of a person, and thought of a person.

In practice, it has been found that the lips of most virtual digital persons speaking are relatively fixed and often cannot be customized individually for different users.

Disclosure of Invention

The embodiment of the application provides a pronunciation correction method, electronic equipment, device and storage medium for a virtual digital person, which can enable the virtual digital person to realize personalized pronunciation.

A first aspect of the embodiment of the present application provides a pronunciation correction method for a virtual digital person, including:

Acquiring initial pronunciation information of the content to be broadcasted, wherein the initial pronunciation information comprises initial speech speed information and/or initial lip shape information;

Correcting the initial pronunciation information according to a first pronunciation correction library to obtain target pronunciation information;

and controlling the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information.

A second aspect of an embodiment of the present application provides a pronunciation correction device for a virtual digital person, including:

The device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring initial pronunciation information of a content to be broadcasted, and the initial pronunciation information comprises initial speech speed information and/or initial lip shape information;

the correcting unit is used for correcting the initial pronunciation information according to the first pronunciation correcting library to obtain target pronunciation information;

and the broadcasting unit is used for controlling the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information.

A third aspect of an embodiment of the present application provides an electronic device, including:

a memory storing executable program code;

and a processor coupled to the memory;

The processor invokes the executable program code stored in the memory, which when executed by the processor causes the processor to implement the method according to the first aspect of the embodiment of the present application.

A fourth aspect of the embodiments of the present application provides a computer readable storage medium having stored thereon executable program code which, when executed by a processor, implements a method according to the first aspect of the embodiments of the present application.

A fifth aspect of an embodiment of the application discloses a computer program product which, when run on a computer, causes the computer to perform the method according to the first aspect of the embodiment of the application.

A sixth aspect of the embodiments of the present application discloses an application publishing platform for publishing a computer program product, wherein the computer program product, when run on a computer, causes the computer to perform the method according to the first aspect of the embodiments of the present application.

From the above technical solutions, the embodiment of the present application has the following advantages:

in the embodiment of the application, initial pronunciation information of the content to be broadcasted is obtained, wherein the initial pronunciation information comprises initial speech speed information and/or initial lip shape information; correcting the initial pronunciation information according to the first pronunciation correction library to obtain target pronunciation information; and controlling the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information.

By implementing the method, before broadcasting the content to be broadcasted, the virtual digital person firstly uses the first pronunciation correction library to correct the initial pronunciation information of the content to be broadcasted, and then broadcasts the content to be broadcasted according to the corrected pronunciation information, so that the virtual digital person can realize personalized pronunciation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments and the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings.

FIG. 1 is a flow chart of a method for correcting pronunciation of a virtual digital person according to an embodiment of the present application;

FIG. 2 is another flow chart of a method of pronunciation correction for a virtual digital person disclosed in an embodiment of the present application;

FIG. 3 is a diagram of key points disclosed in an embodiment of the present application;

FIG. 4 is a schematic diagram of a configuration of a pronunciation correction device for a virtual digital person according to an embodiment of the present application;

Fig. 5 is a structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, reference will now be made to the accompanying drawings in which embodiments of the application are illustrated, it being apparent that the embodiments described are only some, but not all, of the embodiments of the application. Based on the embodiments of the present application, it should be understood that the present application is within the scope of protection.

It will be appreciated that the electronic devices involved in embodiments of the present application may include general hand-held, on-screen electronic user terminals such as cell phones, smart phones, portable terminals, personal digital assistants (Personal DIGITAL ASSISTANT, PDA), portable multimedia player (Personal MEDIA PLAYER, PMP) devices, notebook computers, notebook (Note Pad), wireless broadband (Wireless Broadband, wibro) terminals, tablet computers (Personal Computer, PC), smart PCs, point of Sales (POS), and vehicle computers, among others.

The electronic device may also include a wearable device. The wearable device may be worn directly on the user or be a portable electronic device integrated into the user's clothing or accessories. The wearable device is not only a hardware device, but also can realize powerful intelligent functions through software support, data interaction and cloud server interaction, such as: the mobile phone terminal has the advantages of calculating function, positioning function and alarming function, and can be connected with mobile phones and various terminals. Wearable devices may include, but are not limited to, wrist-supported watch types (e.g., watches, wrist products, etc.), foot-supported shoes (e.g., shoes, socks, or other leg wear products), head-supported glasses types (e.g., glasses, helmets, headbands, etc.), and smart apparel, school bags, crutches, accessories, etc. in various non-mainstream product forms.

After the appearance of the virtual reality technology, the user will typically model the avatar of the virtual digital person with the avatar of the real person, so that the virtual digital person can better simulate the real person. However, in practice, it is found that most of the current modeling for the virtual digital person only focuses on the facial texture of the virtual digital person, and neglects modeling the pronunciation of the virtual digital person, so that the pronunciation of the virtual digital person often cannot be personalized to "copy" the pronunciation of the real person.

In order to solve the problem, in the technical scheme disclosed by the application, before broadcasting the content to be broadcasted, the virtual digital person firstly uses the first pronunciation correction library to correct the initial pronunciation information of the content to be broadcasted, and then broadcasts the content to be broadcasted according to the corrected pronunciation information, so that the virtual digital person can realize personalized pronunciation.

The present technical solution is further described below with reference to specific drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating a pronunciation correction method for a virtual digital person according to an embodiment of the present application. The pronunciation correction method as shown in fig. 1 may include the steps of:

101. and acquiring initial pronunciation information of the content to be broadcasted, wherein the initial pronunciation information comprises initial speech speed information and/or initial lip shape information.

In the embodiment of the present application, the content to be broadcasted may be a specified content stored in advance on the electronic device, or may be obtained by collecting input voice.

Optionally, if the content to be broadcasted is obtained by collecting the input voice, the content to be broadcasted may be the content of the input voice or the answer content aiming at the input voice.

The initial pronunciation information of the content to be broadcasted may include an initial speech speed of each phoneme of the content to be broadcasted, and the initial lip information of the content to be broadcasted may include an initial lip coordinate of each phoneme of the content to be broadcasted.

The phonemes are the minimum phonetic units divided according to the natural properties of the speech, and are analyzed according to pronunciation actions in syllables, one pronunciation action constituting each phoneme. The initial speech rate of the phoneme indicates an initial pronunciation time of the phoneme, and the initial lip coordinates may include initial coordinates of a plurality of lip-portrayed points. The plurality of lip-shaped painting points can comprise a plurality of upper lip painting points and a plurality of lower lip painting points, wherein the upper lip painting points are positioned on the upper lip of the virtual digital person, and the lower lip painting points are positioned on the lower lip of the virtual digital person.

102. And correcting the initial pronunciation information according to the first pronunciation correction library to obtain target pronunciation information.

In some embodiments, the first audio correction library may be constructed according to first audio and video information and second audio and video information when the target user and the virtual digital person respectively broadcast the specified content. Each appointed phoneme contained in the appointed content is a phoneme corresponding to a character with the use frequency larger than a threshold value in the virtual scene. The first audio-video information comprises a first video and a first audio when the target user broadcasts the specified content, and the second audio-video information comprises a second video and a second audio when the virtual digital person broadcasts the specified content.

In some embodiments, the first video when the target user broadcasts the specified content may be acquired through a camera of the electronic device, and the first audio when the target user broadcasts the specified content may be acquired through a pickup of the electronic device.

In some embodiments, a screen recording program of the electronic device may be used to obtain a second video when the virtual digital person broadcasts the specified content, and record audio information in an audio path corresponding to the virtual digital person, so as to obtain a second audio when the virtual digital person broadcasts the specified content.

It should be noted that, the start time of the target user broadcasting the specified content and the start time of the virtual digital person broadcasting the specified content may be the same or different, which is not limited by the embodiment of the present application.

The target pronunciation information may include a target speech speed and/or target lip coordinates of each phoneme of the content to be broadcasted. The target speech rate of the phoneme indicates a target pronunciation time of the phoneme, and the target lip coordinates may include target coordinates of a plurality of lip-portrayed points.

In some embodiments, correcting the initial speech rate information according to the first pronunciation correction library to obtain the target pronunciation information may include:

determining the speech speed deviation of each phoneme of the content to be broadcasted from a first sound correction library, and determining the target speech speed of each phoneme of the content to be broadcasted according to the speech speed deviation and the initial speech speed of each phoneme of the content to be broadcasted;

And/or the number of the groups of groups,

And determining a lip correction function of each phoneme of the content to be broadcasted from the first sounding correction library, and determining a target lip coordinate of each phoneme of the content to be broadcasted according to the lip correction function and the initial lip coordinate of each phoneme of the content to be broadcasted.

103. And controlling the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information.

By implementing the method, before broadcasting the content to be broadcasted, the virtual digital person firstly corrects the initial pronunciation information of the content to be broadcasted by using the first pronunciation correction library, and then broadcasts the content to be broadcasted according to the corrected pronunciation information, so that the virtual digital person can realize personalized pronunciation.

Referring to fig. 2, fig. 2 is another flowchart of a pronunciation correction method for a virtual digital person according to an embodiment of the present application. The pronunciation correction method as shown in fig. 2 may include the steps of:

201. and acquiring initial pronunciation information of the content to be broadcasted, wherein the initial pronunciation information comprises initial speech speed information and/or initial lip shape information.

202. When the content to be broadcasted comprises the target content, correcting initial pronunciation information of other contents except the target content in the content to be broadcasted according to the first pronunciation correction library to obtain target pronunciation information of the other contents.

In some embodiments, the first pitch correction library may include a first lip correction library and a first speech rate correction library. The first lip correction library is used for correcting initial lip coordinates of each phoneme of the content to be broadcasted, and the first speech speed correction library is used for correcting initial speech speed of each phoneme of the content to be broadcasted. The construction of the first pitch correction library is described as follows:

before step 202, phoneme segmentation may be performed on the first audio and the second audio, so as to obtain a first timestamp and a second timestamp corresponding to each phoneme; wherein the first timestamp is a point in time when each phoneme appears in the first audio and the second timestamp is a point in time when each phoneme appears in the second audio; according to the first timestamp and the second timestamp corresponding to each phoneme, determining a first key frame and a second key frame corresponding to each phoneme from the first video and the second video respectively; constructing a first lip correction library according to the first key frame and the second key frame corresponding to each phoneme; and constructing a first speech rate correction library according to the first timestamp and the second timestamp corresponding to each phoneme.

The first key frame corresponding to each phoneme is a picture frame including lips of the target user, and the second key frame corresponding to each phoneme is a picture frame including lips of the virtual digital person.

The time stamp corresponding to each phoneme may be stored in a list, for example, see table 1 below:

Sequence number	Phonemes	Time stamp
			1	Phoneme 1	T1
2	Phoneme 2	T2
			n	Phoneme n	T3

TABLE 1

In some embodiments, constructing a first lip-correction library from the first keyframe and the second keyframe corresponding to each phoneme may include: determining a first coordinate set and a second coordinate set corresponding to each phoneme from a first key frame and a second key frame corresponding to each phoneme respectively, wherein the first coordinate set and the second coordinate set are related to a plurality of key points, the first coordinate set comprises a first coordinate of each key point, and the second coordinate set comprises a second coordinate of each key point; respectively executing normalization operation on the first coordinate set and the second coordinate set corresponding to each phoneme to obtain a first target coordinate set and a second target coordinate set corresponding to each phoneme; determining a lip correction function corresponding to each phoneme according to the first target coordinate set and the second target coordinate set corresponding to each phoneme; and obtaining a first lip correction library according to the lip correction function corresponding to each phoneme.

The performing normalization operation on the first coordinate set and the second coordinate set corresponding to each phoneme may be that performing normalization operation on the coordinates of each key point in the first coordinate set corresponding to each phoneme, and performing normalization operation on the coordinates of each key point in the second coordinate set corresponding to each phoneme.

The plurality of keypoints may include a plurality of upper lip keypoints and a plurality of lower lip keypoints. Wherein, regarding the position of the key point p, see fig. 3. The plurality of key points P may include a first mouth angle P0, a second mouth angle P1, a center point P2 of a lower edge of the first lip, and a center point P3 of an upper edge of the second lip.

In the embodiment of the present application, it may be assumed that the coordinates of the key points corresponding to the virtual digital person are normalized, and at this time, only the coordinates of the key points corresponding to the target user, that is, the coordinates in the first coordinate set corresponding to each phoneme, need to be normalized.

In the embodiment of the present application, the first coordinate set and the second coordinate set corresponding to each phoneme may be stored in a list manner.

The following description will be given taking the storage of the first coordinate set as an example: please refer to table 2 below

TABLE 2

In some embodiments, the normalization operation is performed on the first coordinate set corresponding to each phoneme, so as to obtain a first target coordinate set corresponding to each phoneme, which may include, but is not limited to, the following ways:

Mode 1, respectively obtaining a coordinate of P0 and a coordinate of P1 from a first coordinate set corresponding to each phoneme, and calculating a mouth angle distance value corresponding to each phoneme according to the coordinate of P0 and the coordinate of P1; taking the reciprocal of the maximum mouth angle distance value as a first coefficient; and multiplying each coordinate in the first coordinate set by the first coefficient to obtain a first target coordinate set corresponding to each phoneme.

The calculation formula of the maximum mouth angle distance value is as follows:

Wherein the coordinates of P0 are The coordinates of P1 are/>L1 refers to the maximum mouth angle distance value.

Normalization formula for coordinates in the first set of coordinates:

Wherein, (x _i,y_i) is the coordinates of any key point in the first coordinate set, and (x' _i,y′_i) is the normalized coordinates of any key point.

Mode 2, respectively obtaining the coordinates of P0, the coordinates of P1, the coordinates of P2 and the coordinates of P3 from a first coordinate set corresponding to each phoneme; calculating a mouth angle distance value corresponding to each phoneme according to the coordinates of P0 and the coordinates of P1; calculating an opening and closing distance value corresponding to each phoneme according to the coordinates of P2 and the coordinates of P3; taking the reciprocal of the maximum mouth angle distance value as a first coefficient; and taking the reciprocal of the maximum opening and closing distance value as a second coefficient, multiplying the first coefficient by the abscissa value of each coordinate in the first coordinate set, and multiplying the second coefficient by the ordinate value of each coordinate in the first coordinate set to obtain a first target coordinate set corresponding to each phoneme.

The calculation of the first coefficient may refer to the above description, and will not be repeated here.

The calculation formula of the maximum opening and closing distance value is as follows:

Wherein, the coordinate of P2 is (x _up,y_up), the coordinate of P3 is (x _lower,y_lower), and L2 refers to the maximum opening and closing distance value.

Normalization formula for coordinates in the first set of coordinates:

In some embodiments, determining the lip correction function corresponding to each phoneme according to the first set of target coordinates and the second set of target coordinates corresponding to each phoneme may include: acquiring a first upper lip coordinate set and a first lower lip coordinate set corresponding to each phoneme from a first target coordinate set corresponding to each phoneme; acquiring a second upper lip coordinate set and a second lower lip coordinate set corresponding to each phoneme from a second target coordinate set corresponding to each phoneme; determining an upper lip correction function corresponding to each phoneme by using a first upper lip coordinate set and a second upper lip coordinate set corresponding to each phoneme; ; and determining a lower lip correction function corresponding to each phoneme by using the first lower lip coordinate set and the second lower lip coordinate set corresponding to each phoneme.

The key points may include a plurality of upper lip key points and a plurality of lower lip key points, the first upper lip coordinate set and the second upper lip coordinate set each include coordinates of the plurality of upper lip key points, and the first lower lip coordinate set and the second lower lip coordinate set each include coordinates of the plurality of lower lip key points.

The upper lip correction function may include an upper lip correction function corresponding to each upper lip painting point, where the upper lip painting points are used for fitting an upper lip of a virtual digital person, and the number of upper lip painting points is greater than the number of upper lip key points. The lower lip correcting function may include a lower lip correcting function corresponding to each lower lip painting point, where the lower lip painting points are used for fitting a lower lip of the virtual digital person, and the number of lower lip painting points is greater than the number of lower lip key points.

Because the determination manner of the upper lip correction function of each upper lip depiction point corresponding to each phoneme is the same, the embodiment of the application is described by taking the first upper lip depiction point corresponding to the first phoneme as an example, wherein the first phoneme is any phoneme contained in the appointed content, and the first upper lip depiction point is any upper lip depiction point corresponding to any phoneme.

In some embodiments, determining the upper lip correction function for each phoneme using the first upper lip coordinate set and the second upper lip coordinate set for each phoneme may include:

Determining a second coordinate of a first upper lip key point nearest to the first upper lip depiction point from a second upper lip coordinate set corresponding to the first phoneme;

determining a first coordinate of a first upper lip key point from a first upper lip coordinate set corresponding to a first phoneme, determining a first coordinate of a second upper lip key point nearest to the first upper lip key point, and acquiring a coordinate of a target point between the first upper lip key point and the second upper lip key point on a first key frame;

Calculating an interpolation coefficient corresponding to the target point according to the coordinate of the target point, the first coordinate of the first upper lip key point and the first coordinate of the second upper lip key point;

calculating a coordinate offset parameter corresponding to the first upper lip key point according to the second coordinate of the first upper lip key point and the first coordinate of the first upper lip key point;

And determining an upper lip correction function corresponding to the first upper lip depiction point according to the interpolation coefficient corresponding to the target point, the coordinate offset parameter corresponding to the first upper lip key point and the coordinate of the first upper lip depiction point in the second key frame.

The interpolation coefficients corresponding to the target points comprise interpolation coefficients in the X direction and interpolation coefficients in the Y direction. The coordinate offset parameters corresponding to the first upper lip key point may include an offset parameter in the X direction and an offset parameter in the Y direction.

The specific calculation formula is shown in the following:

Wherein, Refers to the interpolation coefficient of X direction corresponding to the target point,/>Refers to the interpolation coefficient in the Y direction corresponding to the target point. Coordinates (x, y) refer to coordinates of any point between the first upper lip key point and the second upper lip key point on the first key frame, (x _i,y_i) refers to first coordinates of the first upper lip key point, and (x _i+1,y_i+1) refers to first coordinates of the second upper lip key point.

Wherein,Refers to the coordinates of the first upper lip-depicting point in the second frame,/>Refers to the offset parameter in the X direction corresponding to the key point of the first upper lip,/>Refers to the Y-direction offset parameter corresponding to the first upper lip key point,Refers to the interpolation coefficient of X direction corresponding to the target point,/>Refers to the interpolation factor in the Y direction corresponding to the target point, and (x ", Y") refers to the coordinates of the first upper lip-portrayal point.

It should be noted that, regarding the determination method of the lower lip correction function corresponding to the lower lip depiction point, reference may be made to the determination method of the upper lip correction function corresponding to the upper lip depiction point, which is not described herein.

It is understood that the first lip-correction library may include an upper lip-correction function and a lower lip-correction function for each phoneme.

In some embodiments, constructing the first speech rate correction library according to the first timestamp and the second timestamp corresponding to each phoneme may include: calculating the voice speed deviation of each phoneme according to the first timestamp and the second timestamp corresponding to each phoneme; and constructing a first speech rate correction library according to the speech rate deviation of each phoneme.

Under the condition that the target user and the virtual digital person broadcast the designated audio at the same time, the voice speed deviation of the first phoneme can be obtained by subtracting the second timestamp from the first timestamp or subtracting the second timestamp from the second timestamp, and the embodiment of the application is not limited.

Further, after the first pitch correction library is obtained, the validity of the first pitch correction library may be further identified, which may specifically include:

acquiring third audio and video information and fourth audio and video information when the target user and the virtual digital person respectively broadcast the test content; and when the virtual digital person broadcasts the test content, correcting the pronunciation of the virtual digital person by using the first pronunciation correction library to obtain the fourth audio information.

According to the third audio-video information and the fourth audio-video information, calculating pronunciation offset parameters of the virtual digital person relative to the target user; wherein the pronunciation offset parameter may include a lip offset parameter and/or a speech rate offset parameter.

When the pronunciation offset parameter is less than or equal to the offset parameter threshold, the first pronunciation correction library is determined to be valid and execution continues 202, otherwise, the first pronunciation correction library is determined to be invalid.

Wherein, when the pronunciation offset parameter includes a lip offset parameter and a speech speed offset parameter, the offset parameter threshold also includes a lip offset parameter threshold and a speech speed offset parameter threshold. The pronunciation offset parameter less than or equal to the offset parameter threshold value may be: the lip shift parameter is less than or equal to a lip shift parameter threshold and the speech rate shift parameter is less than or equal to a speech rate shift parameter threshold.

In some embodiments, the step of training the first pitch correction library described above may be continued when the first pitch correction library is not valid to train a new first pitch correction library.

In some embodiments, the pronunciation offset parameter comprises a lip offset parameter and a speech speed offset parameter, the third audio-video information comprises third audio and third video, and the fourth audio-video information comprises fourth audio and fourth video; according to the third audio-video information and the fourth audio-video information, calculating the pronunciation offset parameter of the virtual digital person relative to the target user may include:

respectively carrying out phoneme segmentation on the third audio and the fourth audio to obtain a third timestamp and a fourth timestamp corresponding to each phoneme; wherein the third timestamp is a point in time when each phoneme appears in the third audio and the fourth timestamp is a point in time when each phoneme appears in the fourth audio;

According to the third timestamp and the fourth timestamp corresponding to each phoneme, determining a third key frame and a fourth key frame corresponding to each phoneme from a third video and a fourth video respectively; respectively acquiring a third coordinate of each key point corresponding to each phoneme and a fourth coordinate of each key point corresponding to each phoneme from a third key frame and a fourth key frame corresponding to each phoneme; respectively executing normalization operation on the third coordinate of each key point corresponding to each phoneme and the fourth coordinate of each key point corresponding to each phoneme; obtaining lip offset parameters of the virtual digital person relative to the target user according to the third coordinates of each key point corresponding to each normalized phoneme and the fourth coordinates of each key point corresponding to each normalized phoneme;

And calculating the voice speed deviation corresponding to each phoneme according to the third timestamp and the fourth timestamp corresponding to each phoneme, and obtaining the voice speed deviation parameter of the virtual digital person relative to the target user according to the voice speed deviation corresponding to each phoneme.

In some embodiments, obtaining the lip offset parameter of the virtual digital person relative to the target user according to the normalized third coordinate of each key point corresponding to each phoneme and the normalized fourth coordinate of each key point corresponding to each phoneme may include, but is not limited to, the following ways:

Mode 1, calculating the square of the sum of squares and the difference of the vertical coordinates of the horizontal coordinate of each key point corresponding to each phoneme according to the third coordinate of each key point corresponding to each phoneme after normalization and the fourth coordinate of each key point corresponding to each phoneme after normalization, summing the squares of the sum of squares and the difference of the vertical coordinates of the horizontal coordinate of each key point corresponding to each phoneme to obtain the sum of squares of each key point corresponding to each phoneme, accumulating the sum of squares of each key point corresponding to each phoneme to obtain a first sum, dividing the sum by the total number of the accumulated key points to obtain the lip offset parameter.

Specific:

Wherein, Refers to the third coordinate of each key point corresponding to each phoneme,/>

Refers to the fourth coordinate of the corresponding keypoint. n refers to the sum of the keypoints of all phonemes.

Mode 2: calculating the absolute value of the horizontal coordinate difference value and the absolute value of the vertical coordinate difference value of each key point corresponding to each phoneme according to the third coordinate of each key point corresponding to each phoneme after normalization and the fourth coordinate of each key point corresponding to each phoneme after normalization; ; accumulating absolute values of the difference values of the longitudinal coordinates of the key points corresponding to each phoneme to obtain a second total value, and dividing the second total value by the total number of the key points of all phonemes to obtain a first lip offset parameter; and accumulating the absolute value of the horizontal coordinate difference value of each key point corresponding to each phoneme to obtain a third total sum value, and dividing the third total sum value by the total sum number of the key points of all phonemes to obtain a second lip offset parameter.

Specific:

Where n refers to the sum of the keypoints of all phonemes, Refers to the ordinate of the third coordinate of each key point corresponding to each phoneme,/>The ordinate of the fourth coordinate of the finger corresponding to the keypoint.

Where n refers to the sum of the keypoints of all phonemes,Refers to the abscissa in the third coordinate of each key point corresponding to each phoneme,/>The abscissa of the fourth coordinate of the finger corresponding to the keypoint.

In some embodiments, obtaining the speech speed deviation parameter of the virtual digital person relative to the target user according to the speech speed deviation corresponding to each phoneme may include:

And calculating the square of the speech speed deviation corresponding to each phoneme, accumulating the square of the speech speed deviation corresponding to each phoneme to obtain a fourth sum value, and dividing the fourth sum value by the number of the phonemes to obtain the speech speed deviation parameter.

Specific:

wherein n refers to the number of all phonemes, Refers to a third timestamp corresponding to each phoneme,/>Refers to the fourth timestamp of the corresponding phoneme.

203. Correcting the initial pronunciation information of the target content according to the second pronunciation correction library to obtain target pronunciation information of the target content; the second pronunciation correction library is constructed according to pronunciation information customized by a target user for target content.

In some embodiments, the second pronunciation correction library may include a second lip correction library and a second pace correction library. The construction flow of the second pronunciation correction library may include: acquiring target lip coordinates and target speech speed of each phoneme in target content customized by a user; determining lip offset information of each phoneme according to the target lip coordinates and the initial lip coordinates of each phoneme, and obtaining a second lip correction library according to the lip offset information of each phoneme; and determining the speech speed difference value of each phoneme according to the target speech speed and the initial speech speed of each phoneme, and obtaining a second speech speed correction library according to the speech speed difference value of each phoneme. The lip offset information may include a coordinate offset of each upper lip depiction point and a coordinate offset of each lower lip depiction point, among others.

Wherein, the user can customize the lip shape and the speech rate. It can be understood that when the target content is a specific chat sentence, the pronunciation effect of the virtual digital person for the specific chat sentence can be locked, so that the interestingness of communication of the virtual digital person is enhanced.

204. And controlling the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information of other contents and the target pronunciation information of the target content.

By implementing the method, before broadcasting the content to be broadcasted, the virtual digital person judges whether the content to be broadcasted has the target content or not, if so, the second pronunciation correction library is utilized to correct the initial pronunciation information of the target content to obtain the target pronunciation information of the target content, the first pronunciation correction library is utilized to correct the initial pronunciation information of other contents to obtain the target pronunciation information of other contents, and finally, the content to be broadcasted is broadcasted according to the target pronunciation information of the target content and the target pronunciation information of other contents. Therefore, the personalized pronunciation of the virtual digital person is realized, and the communication interestingness of the virtual digital person is greatly enhanced.

Referring to fig. 4, fig. 4 is a schematic diagram of a configuration of a pronunciation correcting device for a virtual digital person according to an embodiment of the present application. The pronunciation correcting apparatus as shown in fig. 4 may include an acquisition unit 401, a correction unit 402, and a broadcasting unit 403; wherein:

An obtaining unit 401, configured to obtain initial pronunciation information of a content to be broadcasted, where the initial pronunciation information includes initial speech speed information and/or initial lip information;

A correction unit 402, configured to correct the initial pronunciation information according to the first pronunciation correction library to obtain target pronunciation information;

And the broadcasting unit 403 is configured to control the virtual digital person to broadcast the content to be broadcasted according to the target pronunciation information.

In some embodiments, the initial speech rate information includes an initial speech rate of each phoneme of the content to be broadcast, and the initial lip information includes initial lip coordinates of each phoneme of the content to be broadcast; the correcting unit 402 is configured to correct the initial pronunciation information according to the first pronunciation correction library, and the method for obtaining the target pronunciation information may specifically include: a correction unit 402, configured to determine, from a first speech correction library, a speech rate deviation of each phoneme of the content to be broadcasted, and determine, according to the speech rate deviation and an initial speech rate of each phoneme of the content to be broadcasted, a target speech rate of each phoneme of the content to be broadcasted; and/or determining a lip correction function of each phoneme of the content to be broadcasted from the first sounding correction library, and determining a target lip coordinate of each phoneme of the content to be broadcasted according to the lip correction function and the initial lip coordinate of each phoneme of the content to be broadcasted.

In some embodiments, the correcting unit 402 is configured to correct the initial pronunciation information according to the first pronunciation correction library, and the manner of obtaining the target pronunciation information may specifically include: when the content to be broadcasted comprises the target content, correcting initial pronunciation information of other contents except the target content in the content to be broadcasted according to the first pronunciation correction library to obtain target pronunciation information of the other contents; the first sound emission correction library is constructed according to first audio and video information and second audio and video information when the target user and the virtual digital person respectively broadcast the appointed content; correcting the initial pronunciation information of the target content according to the second pronunciation correction library to obtain target pronunciation information of the target content; the second pronunciation correction library is constructed according to pronunciation information customized by a target user for target content.

In some embodiments, the first audio correction library includes a first lip correction library and a first speech rate correction library, the first audio correction library is constructed according to first audio and video information and second audio and video information when the target user and the virtual digital person respectively broadcast the specified content, the first audio and video information includes first audio and first video, and the second audio and video information includes second audio and second video;

Further, the pronunciation correction device shown in fig. 4 may further include a correction library construction unit (not shown in the drawing), where the correction library construction unit is configured to segment the first audio and the second audio into phonemes, so as to obtain a first timestamp and a second timestamp corresponding to each phoneme; wherein the first timestamp is a point in time when each phoneme appears in the first audio and the second timestamp is a point in time when each phoneme appears in the second audio; according to the first timestamp and the second timestamp corresponding to each phoneme, a first key frame and a second key frame corresponding to each phoneme are respectively determined from the first video and the second video; constructing a first lip correction library according to the first key frame and the second key frame corresponding to each phoneme; and constructing a first speech rate correction library according to the first timestamp and the second timestamp corresponding to each phoneme.

In some embodiments, the manner in which the correction library constructing unit is configured to construct the first lip-correction library according to the first keyframe and the second keyframe corresponding to each phoneme may specifically include: the correction library construction unit is used for determining a first coordinate set and a second coordinate set corresponding to each phoneme from a first key frame and a second key frame corresponding to each phoneme respectively, wherein the first coordinate set and the second coordinate set are related to a plurality of key points, the first coordinate set comprises a first coordinate of each key point, and the second coordinate set comprises a second coordinate of each key point; respectively executing normalization operation on the first coordinate set and the second coordinate set corresponding to each phoneme to obtain a first target coordinate set and a second target coordinate set corresponding to each phoneme; determining a lip correction function corresponding to each phoneme according to the first target coordinate set and the second target coordinate set corresponding to each phoneme; and obtaining a first lip correction library according to the lip correction function corresponding to each phoneme.

In some embodiments, the manner in which the correction library construction unit is configured to determine the lip correction function corresponding to each phoneme according to the first target coordinate set and the second target coordinate set corresponding to each phoneme may specifically include: the correction library construction unit is used for acquiring a first upper lip coordinate set and a first lower lip coordinate set corresponding to each phoneme from a first target coordinate set corresponding to each phoneme; and acquiring a second upper lip coordinate set and a second lower lip coordinate set corresponding to each phoneme from a second target coordinate set corresponding to each phoneme; and determining an upper lip correction function corresponding to each phoneme by using the first upper lip coordinate set and the second upper lip coordinate set corresponding to each phoneme; and determining a lower lip correction function corresponding to each phoneme by using the first lower lip coordinate set and the second lower lip coordinate set corresponding to each phoneme.

In some embodiments, the manner in which the correction library constructing unit is configured to construct the first speech rate correction library according to the first timestamp and the second timestamp corresponding to each phoneme may specifically include: the correction library construction unit is used for calculating the speech speed deviation of each phoneme according to the first timestamp and the second timestamp corresponding to each phoneme; and constructing a first speech rate correction library according to the speech rate deviation of each phoneme.

Referring to fig. 5, fig. 5 is a schematic diagram of an electronic device according to an embodiment of the application. The electronic device as shown in fig. 5 may include: a processor 501, a memory 502 coupled to the processor 501, wherein the memory 502 may store one or more computer programs.

The processor 501 may include one or more processing cores. The processor 501 utilizes various interfaces and lines to connect various portions of the overall electronic device, perform various functions of the electronic device, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 502, and invoking data stored in the memory 502. Alternatively, the processor 501 may be implemented in at least one hardware form of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 501 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 501 and may be implemented solely by a single communication chip.

The Memory 502 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). Memory 502 may be used to store instructions, programs, code sets, or instruction sets. The memory 502 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the electronic device in use, etc.

In an embodiment of the present application, the processor 501 also has the following functions:

correcting the initial pronunciation information according to the first pronunciation correction library to obtain target pronunciation information;

In the embodiment of the application, the initial speech speed information comprises an initial speech speed of each phoneme of the content to be broadcasted, and the initial lip shape information comprises an initial lip shape coordinate of each phoneme of the content to be broadcasted; the processor 501 also has the following functions:

And/or the number of the groups of groups,

when the content to be broadcasted comprises the target content, correcting initial pronunciation information of other contents except the target content in the content to be broadcasted according to the first pronunciation correction library to obtain target pronunciation information of the other contents; the first sound emission correction library is constructed according to first audio and video information and second audio and video information when the target user and the virtual digital person respectively broadcast the appointed content;

Correcting the initial pronunciation information of the target content according to the second pronunciation correction library to obtain target pronunciation information of the target content; the second pronunciation correction library is constructed according to pronunciation information customized by a target user for target content.

In the embodiment of the application, the first audio correction library comprises a first lip correction library and a first speech speed correction library, the first audio correction library is constructed according to first audio and video information and second audio and video information when a target user and a virtual digital person respectively broadcast appointed contents, the first audio and video information comprises first audio and first video, and the second audio and video information comprises second audio and second video;

the processor 501 also has the following functions:

Respectively carrying out phoneme segmentation on the first audio and the second audio to obtain a first timestamp and a second timestamp corresponding to each phoneme; wherein the first timestamp is a point in time when each phoneme appears in the first audio and the second timestamp is a point in time when each phoneme appears in the second audio;

according to the first timestamp and the second timestamp corresponding to each phoneme, determining a first key frame and a second key frame corresponding to each phoneme from the first video and the second video respectively;

constructing a first lip correction library according to the first key frame and the second key frame corresponding to each phoneme;

and constructing a first speech rate correction library according to the first timestamp and the second timestamp corresponding to each phoneme.

Determining a first coordinate set and a second coordinate set corresponding to each phoneme from a first key frame and a second key frame corresponding to each phoneme respectively, wherein the first coordinate set and the second coordinate set are related to a plurality of key points, the first coordinate set comprises a first coordinate of each key point, and the second coordinate set comprises a second coordinate of each key point;

Respectively executing normalization operation on the first coordinate set and the second coordinate set corresponding to each phoneme to obtain a first target coordinate set and a second target coordinate set corresponding to each phoneme;

determining a lip correction function corresponding to each phoneme according to the first target coordinate set and the second target coordinate set corresponding to each phoneme;

and obtaining a first lip correction library according to the lip correction function corresponding to each phoneme.

acquiring a first upper lip coordinate set and a first lower lip coordinate set corresponding to each phoneme from a first target coordinate set corresponding to each phoneme;

Acquiring a second upper lip coordinate set and a second lower lip coordinate set corresponding to each phoneme from a second target coordinate set corresponding to each phoneme;

determining an upper lip correction function corresponding to each phoneme by using a first upper lip coordinate set and a second upper lip coordinate set corresponding to each phoneme;

And determining a lower lip correction function corresponding to each phoneme by using the first lower lip coordinate set and the second lower lip coordinate set corresponding to each phoneme.

Calculating the voice speed deviation of each phoneme according to the first timestamp and the second timestamp corresponding to each phoneme;

and constructing a first speech rate correction library according to the speech rate deviation of each phoneme.

An embodiment of the present application discloses a computer-readable storage medium storing a computer program, where the computer program, when executed by a processor, causes the processor to implement some or all of the steps performed by an electronic device in the above embodiment.

The embodiment of the application discloses a computer program product which enables a computer to execute part or all of the steps executed by electronic equipment in the embodiment when the computer program product runs on the computer.

The embodiment of the application discloses an application release platform which is used for releasing a computer program product, wherein when the computer program product runs on a computer, the computer is caused to execute part or all of the steps executed by electronic equipment in the embodiment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be stored by a computer or data storage devices such as servers, data centers, etc. that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disk, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable magnetic disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or the like, which can store program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for correcting pronunciation of a virtual digital person, comprising:

2. The method of claim 1, wherein the initial speech rate information comprises an initial speech rate for each phoneme of the content to be announced, and wherein the initial lip information comprises initial lip coordinates for each phoneme of the content to be announced;

the correcting the initial speech speed information according to the first pronunciation correction library to obtain target pronunciation information includes:

And/or the number of the groups of groups,

And determining a lip correction function of each phoneme of the content to be broadcasted from the first sounding correction library, and determining a target lip coordinate of each phoneme of the content to be broadcasted according to the lip correction function of each phoneme of the content to be broadcasted and the initial lip coordinate.

3. The method of claim 1, wherein correcting the initial pronunciation information according to the first pronunciation correction library to obtain target pronunciation information comprises:

When the content to be broadcasted comprises target content, correcting initial pronunciation information of other content except the target content in the content to be broadcasted according to a first pronunciation correction library to obtain target pronunciation information of the other content; the first sound emission correction library is constructed according to first audio and video information and second audio and video information when the target user and the virtual digital person respectively broadcast the appointed content;

correcting the initial pronunciation information of the target content according to a second pronunciation correction library to obtain target pronunciation information of the target content; the second pronunciation correction library is constructed according to pronunciation information customized by the target user for the target content.

4. The method of claim 1, wherein the first voice correction library comprises a first lip correction library and a first speech rate correction library, the first voice correction library is constructed based on first audio-video information and second audio-video information when the target user and the virtual digital person respectively broadcast the specified content, the first audio-video information comprises a first audio and a first video, the second audio-video information comprises a second audio and a second video,

Wherein before the initial pronunciation information is corrected according to the first pronunciation correction library to obtain the target pronunciation information, the method further comprises:

Respectively carrying out phoneme segmentation on the first audio and the second audio to obtain a first timestamp and a second timestamp corresponding to each phoneme; wherein the first timestamp is a point in time when each phoneme appears in a first audio and the second timestamp is a point in time when each phoneme appears in a second audio;

Constructing a first lip correction library according to a first key frame and a second key frame corresponding to each phoneme;

and constructing the first speech rate correction library according to the first timestamp and the second timestamp corresponding to each phoneme.

5. The method of claim 4, wherein constructing the first lip-correction library from the first keyframe and the second keyframe corresponding to each phoneme comprises:

determining a first coordinate set and a second coordinate set corresponding to each phoneme from a first key frame and a second key frame corresponding to each phoneme, wherein the first coordinate set and the second coordinate set are related to a plurality of key points, the first coordinate set comprises a first coordinate of each key point, and the second coordinate set comprises a second coordinate of each key point;

And obtaining the first lip correction library according to the lip correction function corresponding to each phoneme.

6. The method of claim 5, wherein determining the lip correction function for each phone based on the first set of target coordinates and the second set of target coordinates for each phone comprises:

7. The method of any of claims 4-6, wherein constructing the first lip-correction library from the first keyframe and the second keyframe corresponding to each phoneme comprises:

And constructing the first speech rate correction library according to the speech rate deviation of each phoneme.

8. A pronunciation correction device for a virtual digital person, comprising:

9. An electronic device, comprising:

a memory storing executable program code;

and a processor coupled to the memory;

The processor invoking the executable program code stored in the memory, which when executed by the processor, causes the processor to implement the method of any of claims 1-7.

10. A computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, causes the processor to implement the method of any one of claims 1-7.