CN118230754A

CN118230754A - Data enhancement method, device, electronic equipment and storage medium

Info

Publication number: CN118230754A
Application number: CN202410146024.1A
Authority: CN
Inventors: 杜宗财; 赵亚飞; 范锡睿; 陈毅; 王志强; 秦勤
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2024-02-01
Filing date: 2024-02-01
Publication date: 2024-06-21

Abstract

The present disclosure provides a data enhancement method, apparatus, electronic device, and storage medium, and relates to the field of artificial intelligence such as deep learning, virtual digital man, and natural language processing. The method may include: aiming at any training data in a training data set, respectively taking the audio in the training data set as source audio, and carrying out audio transformation on the source audio to obtain target audio, wherein the training data consists of the source audio and corresponding face parameters, and the training data set is used for carrying out digital man-driven model training; and obtaining a mapping relation between the target audio and the source audio, determining the face parameter corresponding to the target audio according to the mapping relation and the face parameter corresponding to the source audio, and taking training data consisting of the target audio and the face parameter corresponding to the target audio as newly generated training data. By applying the scheme disclosed by the disclosure, the data volume of training data and the like can be improved.

Description

Data enhancement method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to artificial intelligence technology, and in particular, to a data enhancement method, apparatus, electronic device, and storage medium in the fields of deep learning, virtual digital man, natural language processing, and the like.

Background

With the development of technology, audio-driven digital human (typically three-dimensional digital human) facial tasks are becoming increasingly popular in different scenarios. Accordingly, the digital person driving model is trained by using the acquired training data to realize the task of driving the digital person face by the audio, namely, matching the expression, the mouth shape and the like of the digital person face with the audio. The training data may also be referred to as training paired data, and each piece of training data may include audio and corresponding facial parameters.

Disclosure of Invention

The disclosure provides a data enhancement method, a data enhancement device, electronic equipment and a storage medium.

A method of data enhancement, comprising:

Aiming at any training data in a training data set, respectively taking the audio in the training data set as source audio, and carrying out audio transformation on the source audio to obtain target audio, wherein the training data consists of the source audio and corresponding face parameters, and the training data set is used for carrying out digital man-driven model training;

And acquiring a mapping relation between the target audio and the source audio, determining the face parameter corresponding to the target audio according to the mapping relation and the face parameter corresponding to the source audio, and taking training data consisting of the target audio and the face parameter corresponding to the target audio as newly generated training data.

A data enhancement device, comprising: an audio conversion module and a data generation module;

The audio conversion module is used for respectively taking the audio in any training data set as source audio and carrying out audio conversion on the source audio to obtain target audio, wherein the training data set consists of the source audio and corresponding face parameters, and is used for carrying out digital human-driven model training;

The data generation module is configured to obtain a mapping relationship between the target audio and the source audio, determine a face parameter corresponding to the target audio according to the mapping relationship and the face parameter corresponding to the source audio, and use training data composed of the target audio and the face parameter corresponding to the target audio as newly generated training data.

An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described above.

A computer program product comprising computer programs/instructions which when executed by a processor implement a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of an embodiment of a data enhancement method according to the present disclosure;

FIG. 2 is a schematic diagram of an implementation process of the data enhancement method of the present disclosure;

FIG. 3 is a schematic diagram of the structure of a first embodiment 300 of the data enhancement device according to the present disclosure;

FIG. 4 is a schematic diagram of the structure of a second embodiment 400 of the data enhancement device according to the present disclosure;

Fig. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Fig. 1 is a flowchart of an embodiment of a data enhancement method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.

In step 101, for any training data in a training data set, audio in the training data set is used as source Audio (Audio _S), audio conversion is performed on the source Audio to obtain target Audio (Audio _T), the training data is composed of the source Audio and corresponding face parameters, and the training data set is used for performing digital man-driven model training.

In step 102, a mapping relationship between the target audio and the source audio is obtained, face parameters corresponding to the target audio are determined according to the mapping relationship and the face parameters corresponding to the source audio, and training data composed of the face parameters corresponding to the target audio and the target audio is used as newly generated training data.

Because the acquisition of training data in a training dataset is often difficult, the amount of data is often small, e.g., less than 100 hours in total, thereby affecting the effectiveness of subsequent model training.

By adopting the scheme of the embodiment of the method, new training data can be generated based on the existing training data through operations such as audio conversion, mapping relation determination, facial parameter determination and the like, so that the expansion of the existing training data is realized, namely, the data quantity of the training data is greatly improved, and further, the model training effect and the like are improved.

For any training data, the training data includes audio and corresponding facial parameters, where the facial parameters may be shape mixing (BlendShapes) parameters, such as 51 parameters of an augmented reality development platform (ARKit) standard, or may also be vertex streams, such as dimension v×3, where V represents an approximate three-dimensional face structure using V three-dimensional coordinate points, and the specific value may be determined according to actual needs, and 3 represents three coordinate axes x, y, and z. Since BlendShapes parameters are more compact and can be converted with vertex streams, blendShapes parameters are typically used as facial parameters.

For the audio in any training data, the training data can be used as source audio, and the source audio can be subjected to audio transformation to obtain target audio.

Preferably, the source audio may be first subjected To Speech recognition (ASR, automatic Speech Recognition) To obtain a recognized Text result, i.e. a Text result corresponding To the audio, and then the Text result may be subjected To Text-To-Speech (TTS) conversion To obtain the desired target audio, at least one of the timbre, volume, speech speed and intonation of the target audio being different from the source audio.

The tone, volume, speech speed, intonation and the like adopted by the target audio can be set according to actual needs.

Through the processing, training data of various tone colors, volume, speech speed and intonation can be generated, such as training data of unusual tone colors, training data of faster speech speed and the like, so that various requirements of a real application scene are covered as much as possible, and diversity of the training data is improved.

Then, the mapping relation between the target audio and the source audio can be determined. Preferably, the audio Feature of the target audio may be obtained as a first audio Feature (Feature _T), and the audio Feature of the source audio may be obtained as a second audio Feature (Feature _S), and then the mapping relationship between the target audio and the source audio may be determined by a dynamic programming manner according to the first audio Feature and the second audio Feature.

The target audio and the source audio can be respectively used as the input of the pre-training audio large model, so that the output first audio feature and the output second audio feature are respectively obtained. The pre-trained audio large model may be a large-scale multilingual speech (MMS, MASSIVELY MULTILINGUAL SPEECH) model or a speech pre-training model (wav 2vec 2), etc. On the basis, the mapping relation between the target audio and the source audio can be further determined in a dynamic programming mode according to the first audio characteristics and the second audio characteristics.

Preferably, a two-dimensional matrix of size T x S can be constructedWherein T represents the number of frames included in the first audio feature, S represents the number of frames included in the second audio feature, and the value of each element M (i, j) in the two-dimensional matrix can be sequentially calculated in the order from left to right in each row and the order from top to bottom in each row by: obtaining the distance between the ith frame in the first audio feature and the jth frame in the second audio feature, determining the minimum value of M (i-1, j-1), M (i, j-1) and M (i-1, j), taking the sum of the distance and the minimum value as the value of the element M (i, j), wherein i is more than or equal to 1 and less than or equal to T, j is more than or equal to 1 and less than or equal to S, correspondingly, determining the shortest path from M (1, 1) to M (T, S) according to the calculation result, and determining the mapping relation between the target audio and the source audio according to the element on the shortest path.

Wherein the first frame of the first audio feature corresponds to the first frame of the second audio feature, and the T-th frame of the first audio feature corresponds to the S-th frame of the second audio feature.

The distance Dist (i, j) between the i-th frame in the first audio feature and the j-th frame in the second audio feature may be:

Wherein, Representing the i-th frame in the first audio feature,/>Represents the j-th frame in the second audio feature, ° represents the vector dot product, |·| represents the vector modulo length.

Accordingly, the value of element M (i, j) may be:

M(i，j)＝Dist(i，j)+min(M(i-1，j-1)，M(i，j-1)，M(i-1，j))； (2)

wherein min represents a minimum value.

In practical applications, for any one of M (i, j), M (i-1, j-1), M (i, j-1) and M (i-1, j) may exist, or only a part may exist, if only a part exists, the existing part may be directly subjected to minimum value, and in addition, specifically, for M (1, 1), dist (1, 1) or a set initial value may be directly used as the value.

For any M (i, j), the element position corresponding to the minimum of M (i-1, j-1), M (i, j-1) and M (i-1, j) represents the shortest path to the (i, j) position, e.g., i and j are 5 and 8, respectively, and the minimum of M (i-1, j-1), M (i, j-1) and M (i-1, j) is M (i-1, j-1), then the last position of the shortest path to the (5, 8) position is (4, 7).

Accordingly, by calculating the values of the elements in the two-dimensional matrix one by one, the shortest path from M (1, 1) to M (T, S) can be obtained.

It can be seen that the required shortest path can be determined efficiently and accurately through the processing, so that a good foundation is laid for subsequent processing.

Preferably, when determining the mapping relationship between the target audio and the source audio according to the element on the shortest path, for each element on the shortest path, the frame in the first audio feature and the frame in the second audio feature corresponding to the element may be respectively used as the frame having the correspondence relationship, and for each frame in the first audio feature, the following processing may be respectively performed: taking the frame as a frame to be processed, in response to determining that the frame to be processed has a corresponding relation with only one frame in the second audio feature, and in response to determining that the frame in the second audio feature has a corresponding relation with only the frame to be processed, taking the frame in the second audio feature, which has a corresponding relation with the frame to be processed, as a mapping frame of the frame to be processed, in response to determining that at least two frames in the second audio feature have a corresponding relation with the frame to be processed, a mapping frame of the frame to be processed in the second audio feature can be determined in a first preset manner, in response to determining that at least two frames including the frame to be processed in the first audio feature have a corresponding relation with the same frame in the second audio feature, a mapping frame of the frame to be processed in the second audio feature can be determined in a second preset manner, and accordingly, each frame in the first audio feature and the corresponding mapping frame can be taken as the determined mapping relation.

For example, the element (5, 8) on the shortest path indicates that the 5 th frame in the first audio feature and the 8 th frame in the second audio feature are frames having a correspondence.

In addition, for convenience of description, any frame in the first audio feature may be referred to as a frame to be processed, and if it is determined that the frame to be processed only has a correspondence with one frame in the second audio feature, and it is determined that the frame in the second audio feature that has a correspondence with the frame to be processed only has a correspondence with the frame to be processed, then the frame in the second audio feature that has a correspondence with the frame to be processed may be directly used as a mapping frame of the frame to be processed. For example, assuming that the frame to be processed is the 5 th frame in the first audio feature, only the corresponding relation exists between the frame to be processed and the 7 th frame in the second audio feature, and only the corresponding relation exists between the 7 th frame in the second audio feature and the 5 th frame in the first audio feature, the 7 th frame in the second audio feature can be directly used as the mapping frame of the 5 th frame in the first audio feature.

If it is determined that the correspondence between the frame to be processed and at least two frames in the second audio feature exists (i.e. there is a one-to-many case), or if it is determined that the correspondence between at least two frames including the frame to be processed in the first audio feature and the same frame in the second audio feature exists (i.e. there is a many-to-one case), the mapping frame of the frame to be processed in the second audio feature may be determined by respective corresponding predetermined manners.

Preferably, in response to determining that a correspondence exists between a frame to be processed and at least two frames in the second audio feature, a mean value of frame numbers of frames in the second audio feature, which have a correspondence with the frame to be processed, may be obtained, the frame numbers of any frame in the second audio feature are integers between 1 and S, respectively, and the frame in the second audio feature corresponding to the mean value may be used as a mapping frame of the frame to be processed.

For example, assuming that the frame to be processed is the 5 th frame in the first audio feature, and there is a correspondence relationship with the 7 th frame, 8 th frame, 9 th frame and 10 th frame in the second audio feature, the required average value can be calculated: (7+8+9+10)/4=8.5, and accordingly, the 8.5 th frame in the second audio feature may be taken as a mapped frame of the 5 th frame in the first audio feature.

It can be seen that the mapped frame determined in the above manner may be a true present frame or a true absent frame, such as the 8.5 th frame described above.

In addition, preferably, in response to determining that at least two frames including a frame to be processed in the first audio feature have a corresponding relationship with the same frame in the second audio feature, the number of the at least two frames, the position of the frame to be processed in the at least two frames, the frame number of the frame having a corresponding relationship with the frame to be processed in the second audio feature, and the frame number in the second audio feature may be combined, and the mapping frame of the frame to be processed in the second audio feature is determined by interpolation, where the frame number of any frame in the second audio feature is an integer between 1 and S.

For example, assuming that the i-th frame, i+1-th frame, and i+k-th frame in the first audio feature have a correspondence to the j-th frame in the second audio feature, the frame numbers of the mapping frames corresponding to the i+r (r=0, 1, i+1-th, k-th frame may be:

wherein min represents a minimum value and max represents a maximum value.

Specifically, assuming that the 5 th frame, the 6 th frame, and the 7 th frame in the first audio feature have a correspondence with the 9 th frame in the second audio feature, the value of j will be 9,k when calculated according to formula (3) will be 2, the value of r will be 0, the value of j will be 9,k when calculated according to formula (3) will be 1, and the value of j will be 9,k when calculated according to formula (3) will be 2 for the 6 th frame.

Through the processing, whether the frame in the first audio feature and the frame in the second audio feature are in a one-to-many correspondence, a many-to-one correspondence or a correspondence other than one-to-many or many-to-one correspondence, the mapping frame of the frame in the first audio feature in the second audio feature can be accurately determined, and the time sequence fluency and the like of the face parameter corresponding to the target audio generated subsequently can be improved by adopting the processing mode aiming at the one-to-many and many-to-one condition.

Accordingly, after the mapping relation between the target audio and the source audio is obtained, the face parameter corresponding to the target audio can be determined according to the mapping relation and the face parameter corresponding to the source audio.

Preferably, the following processing may be performed separately for each frame in the first audio feature: and taking the frame as a frame to be indexed, determining a mapping frame of the frame to be indexed, taking a face parameter corresponding to the mapping frame as a face parameter corresponding to the frame to be indexed in response to determining that the frame number of the mapping frame is an integer, taking the frame number of the mapping frame as a non-integer, taking the frame number of the mapping frame as an upward integer and a downward integer respectively, respectively acquiring the face parameter corresponding to the frame corresponding to the upward integer result and the face parameter corresponding to the frame corresponding to the downward integer result, carrying out linear interpolation on the face parameters acquired twice, taking the linear interpolation result as the face parameter corresponding to the frame to be indexed, and further utilizing the face parameters corresponding to the frames in the first audio feature to form the face parameter corresponding to the target audio.

After the previous processing, each frame in the first audio feature corresponds to a unique mapping frame in the second audio feature, and the frame number of the mapping frame may be an integer or a non-integer, such as 8.5.

For example, assuming that the frame to be indexed is the 5th frame in the first audio feature and the mapped frame is the 8 th frame in the second audio feature, the face parameter corresponding to the 8 th frame in the second audio feature may be directly used as the face parameter corresponding to the 5th frame in the first audio feature.

For another example, assuming that the frame to be indexed is the 6 th frame in the first audio feature and the mapping frame is the 8.5 th frame in the second audio feature, then the face parameter corresponding to the 8 th frame in the second audio feature and the face parameter corresponding to the 9 th frame in the second audio feature may be obtained respectively, further the face parameter corresponding to the 8 th frame and the face parameter corresponding to the 9 th frame may be linearly interpolated, that is, the linear interpolation of the adjacent frame face parameters may be performed, and the interpolation result may be used as the face parameter corresponding to the 5 th frame in the first audio feature.

After the face parameters corresponding to the frames in the first audio feature are respectively obtained in the above manner, the face parameters corresponding to the frames in the first audio feature can be utilized to form the face parameters corresponding to the target audio, and further, training data formed by the target audio and the face parameters corresponding to the target audio can be used as newly generated training data, so that the expansion of the training data is realized, and the data quantity, the diversity and the like of the training data are improved.

In connection with the foregoing description, fig. 2 is a schematic diagram of an implementation process of the data enhancement method described in the present disclosure.

As shown in fig. 2, based on the source audio, the target audio may be obtained through audio conversion, wherein the source audio may be first subjected to speech recognition to obtain a recognized text result, and then the text result may be subjected to text-to-speech conversion, thereby obtaining the target audio, and at least one of tone, volume, speed and intonation of the target audio is different from the source audio.

As shown in fig. 2, after the target audio is obtained, the mapping relationship between the source audio and the target audio can be determined by a dynamic programming manner, that is, the audio feature of the target audio can be obtained and used as the first audio feature, the audio feature of the source audio can be obtained and used as the second audio feature, and further, the mapping relationship between the target audio and the source audio can be determined by a dynamic programming manner according to the first audio feature and the second audio feature.

As shown in fig. 2, further, according to the mapping relationship and the face parameters corresponding to the source audio, the face parameters corresponding to the target audio may be determined (i.e. generated), that is, the following processes may be performed for each frame in the first audio feature: and taking the frame as a frame to be indexed, determining a mapping frame of the frame to be indexed, taking a face parameter corresponding to the mapping frame as a face parameter corresponding to the frame to be indexed in response to determining that the frame number of the mapping frame is an integer, taking the frame number of the mapping frame as a non-integer, respectively taking the frame number of the mapping frame up and down, respectively obtaining a face parameter corresponding to the frame corresponding to the up-rounding result and a face parameter corresponding to the frame corresponding to the down-rounding result, carrying out linear interpolation on the face parameters obtained in two times, taking the linear interpolation result as the face parameter corresponding to the frame to be indexed, and further utilizing the face parameters corresponding to each frame in the first audio feature to form the face parameter corresponding to the target audio. Accordingly, training data composed of the target audio and the facial parameters corresponding to the target audio can be used as newly generated training data.

In addition, noise may be added to the audio in the training data for any one of the training data sets, preferably. The training data may refer to training data newly generated in a manner described in the present disclosure, or may refer to training data that is original in the training data set.

Preferably, the means for adding noise to the training data may include: the random music is fused into the audio in the training data, and/or the audio in the other training data is fused into the audio in the training data after the volume is reduced, and/or the audio in the training data is firstly downsampled and then upsampled, and/or the noise in the preset noise data set is fused into the audio in the training data.

The random music is fused into the audio in the training data, namely, a piece of randomly acquired music is fused into the audio in the training data, and music noise can be obtained by fusing the random music. The audio in the other training data is fused into the audio in the training data after the volume is reduced, namely, the audio in any training data except the training data is fused into the audio in the training data after the volume is reduced according to the preset amplitude, and the multi-person speaking noise can be obtained by fusing the audio in the other training data. The audio in the training data is firstly downsampled and then upsampled, namely, the training data is firstly downsampled, then the downsampling result is upsampled, and the digital noise can be obtained by downsampling and upsampling. In addition, the predetermined noise data set may be an existing noise data set, for example, may be an open source speech learning resource platform (OpenSLR) noise data set, and with the aid of the noise data set, noise integration such as white noise, machine noise, and spurious radio frequency interference may be achieved.

In a real application scene, various background noise may be included in audio driving the face of a digital person, such as voice uttered by other persons or background music, and the existing training data is insufficient to cover the requirements of the real application scene.

After the processing mode is adopted, noise can be added to the audio in the training data, for example, the various noises can be randomly added to the audio of each training data in the training data set with a certain probability, accordingly, the situation that various noises are added to the audio of a certain piece of training data can occur, the situation that no noise is added to the audio of a certain piece of training data can also occur, and therefore the robustness and the like of subsequent model training are further improved.

In addition, noise training data may be preferably constructed, and each piece of noise training data may include: noise and corresponding facial parameters, wherein the values of the facial parameters corresponding to the noise are all 0, and the constructed noise training data can be added into the training data set.

The independent noise training data can be constructed outside the original training data in the training data set, and the constructed noise training data can be added into the training data set, so that the model can be trained by utilizing the training data set subsequently, and the face can not be disturbed when the model is input into noise, thereby further improving the diversity of the training data, further improving the robustness of model training and the like.

The noise in the noise training data may be the music noise, white noise, machine noise, stray radio frequency interference, and the like, and the corresponding face parameters may be 0.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all of the preferred embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.

Fig. 3 is a schematic structural diagram of a first embodiment 300 of the data enhancement device according to the present disclosure. As shown in fig. 3, includes: an audio conversion module 301 and a data generation module 302.

The audio conversion module 301 is configured to, for any training data in a training data set, respectively take an audio therein as a source audio, and perform audio conversion on the source audio to obtain a target audio, where the training data is composed of the source audio and corresponding face parameters, and the training data set is a training data set for performing digital man-driven model training.

The data generating module 302 is configured to obtain a mapping relationship between the target audio and the source audio, determine a face parameter corresponding to the target audio according to the mapping relationship and the face parameter corresponding to the source audio, and use training data composed of the face parameters corresponding to the target audio and the target audio as newly generated training data.

By adopting the scheme of the embodiment of the device, new training data can be generated based on the existing training data through operations such as audio conversion, mapping relation determination, facial parameter determination and the like, so that the expansion of the existing training data is realized, namely, the data quantity of the training data is greatly improved, and further, the model training effect and the like are improved.

Preferably, the audio conversion module 301 may perform speech recognition on the source audio first to obtain a recognized text result, i.e. a text result corresponding to the audio, and then may perform conversion from text to speech on the text result, so as to obtain the desired target audio, where at least one of the tone, volume, speech speed and intonation of the target audio is different from that of the source audio.

Then, the mapping relation between the target audio and the source audio can be determined. Preferably, the data generating module 302 may obtain the audio feature of the target audio as the first audio feature, and may obtain the audio feature of the source audio as the second audio feature, and then determine the mapping relationship between the target audio and the source audio by a dynamic programming manner according to the first audio feature and the second audio feature.

The target audio and the source audio can be respectively used as the input of the pre-training audio large model, so that the output first audio feature and the output second audio feature are respectively obtained.

Preferably, the data generation module 302 may construct a two-dimensional matrix of t×s sizeWherein T represents the number of frames included in the first audio feature, S represents the number of frames included in the second audio feature, and the value of each element M (i, j) in the two-dimensional matrix can be sequentially calculated in the order from left to right in each row and the order from top to bottom in each row by: obtaining the distance between the ith frame in the first audio feature and the jth frame in the second audio feature, determining the minimum value of M (i-1, j-1), M (i, j-1) and M (i-1, j), taking the sum of the distance and the minimum value as the value of the element M (i, j), wherein i is more than or equal to 1 and less than or equal to T, j is more than or equal to 1 and less than or equal to S, correspondingly, determining the shortest path from M (1, 1) to M (T, S) according to the calculation result, and determining the mapping relation between the target audio and the source audio according to the element on the shortest path.

Preferably, when determining the mapping relationship between the target audio and the source audio according to the element on the shortest path, the data generating module 302 may respectively use, for each element on the shortest path, a frame in the first audio feature and a frame in the second audio feature corresponding to the element as a frame having a correspondence relationship, and may respectively perform, for each frame in the first audio feature, the following processing: taking the frame as a frame to be processed; in response to determining that the frame to be processed has a corresponding relation with only one frame in the second audio feature, and determining that the frame in the second audio feature, which has a corresponding relation with the frame to be processed, has a corresponding relation with only the frame to be processed, the frame in the second audio feature, which has a corresponding relation with the frame to be processed, is used as a mapping frame of the frame to be processed; in response to determining that a correspondence exists between the frame to be processed and at least two frames in the second audio feature, a mapping frame of the frame to be processed in the second audio feature can be determined in a first predetermined manner; in response to determining that at least two frames including the frame to be processed in the first audio feature have a corresponding relationship with the same frame in the second audio feature, determining a mapping frame of the frame to be processed in the second audio feature in a second predetermined manner; accordingly, each frame in the first audio feature and the corresponding mapping frame can be used as the determined mapping relation.

Preferably, in response to determining that there is a correspondence between the frame to be processed and at least two frames in the second audio feature, the data generating module 302 may obtain an average value of frame numbers of frames in the second audio feature, where the frame numbers of any frame in the second audio feature are integers between 1 and S, and may use a frame in the second audio feature corresponding to the average value as a mapping frame of the frame to be processed.

In addition, preferably, in response to determining that at least two frames including a frame to be processed in the first audio feature have a correspondence with the same frame in the second audio feature, the data generating module 302 may determine, by interpolation, a mapping frame of the frame to be processed in the second audio feature, where a frame number of any frame in the second audio feature is an integer between 1 and S, in combination with a number of the at least two frames, a position of the frame to be processed in the at least two frames, a frame number of a frame in the second audio feature having a correspondence with the frame to be processed, and a frame number in the second audio feature.

Further, after the mapping relationship between the target audio and the source audio is obtained, the face parameter corresponding to the target audio can be determined according to the mapping relationship and the face parameter corresponding to the source audio.

Preferably, the data generation module 302 may perform the following processing for each frame in the first audio feature: taking the frame as a frame to be indexed, and determining a mapping frame of the frame to be indexed; responding to the fact that the frame sequence number of the mapping frame is an integer, and taking the face parameter corresponding to the mapping frame as the face parameter corresponding to the frame to be indexed; in response to determining that the frame number of the mapping frame is a non-integer, respectively rounding up and rounding down the frame number of the mapping frame, respectively obtaining face parameters corresponding to the frame corresponding to the rounding up result and face parameters corresponding to the frame corresponding to the rounding down result, performing linear interpolation on the face parameters obtained in two times, and taking the linear interpolation result as the face parameters corresponding to the frame to be indexed; and utilizing the facial parameters corresponding to each frame in the first audio feature to form the facial parameters corresponding to the target audio.

Fig. 4 is a schematic structural diagram of a second embodiment 400 of the data enhancement device according to the present disclosure. As shown in fig. 4, includes: an audio conversion module 301, a data generation module 302 and a noise processing module 303.

The audio conversion module 301 and the data generation module 302 are the same as those in the embodiment shown in fig. 3, and will not be described again.

The noise processing module 303 is configured to add noise to audio in any training data in the training data set, respectively. The training data may refer to training data newly generated in a manner described in the present disclosure, or may refer to training data that is original in the training data set.

Preferably, the noise processing module 303 adds noise to the training data in a manner that includes: the random music is fused into the audio in the training data, and/or the audio in the other training data is fused into the audio in the training data after the volume is reduced, and/or the audio in the training data is firstly downsampled and then upsampled, and/or the noise in the preset noise data set is fused into the audio in the training data.

In addition, the noise processing module 303 may further construct noise training data, where each piece of noise training data may include: noise and corresponding facial parameters, wherein the values of the facial parameters corresponding to the noise are all 0, and the constructed noise training data can be added into the training data set.

The specific workflow of the embodiment of the apparatus shown in fig. 3 and fig. 4 may refer to the related description in the foregoing method embodiment, and will not be repeated.

In a word, by adopting the scheme disclosed by the disclosure, the data volume and diversity of training data can be greatly improved, so that the model training effect is improved, in addition, the robustness of the model can be remarkably enhanced by adding noise to audio in the training data and constructing noise training data, and the scheme disclosed by the disclosure can be applied to various virtual digital person related products, and has wide applicability.

The scheme disclosed by the disclosure can be applied to the field of artificial intelligence, and particularly relates to the fields of deep learning, virtual digital people, natural language processing and the like. Artificial intelligence is the subject of studying certain thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) that make a computer simulate a person, and has technology at both hardware and software levels, and artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc., and artificial intelligence software technologies mainly include computer vision technologies, speech recognition technologies, natural language processing technologies, machine learning/deep learning, big data processing technologies, knowledge graph technologies, etc.

The audio and facial parameters, etc. in the embodiments of the present disclosure are not specific to a particular user and do not reflect personal information of a particular user. In addition, in the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user all accord with the regulations of related laws and regulations, and the public order is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 includes a computing unit 501 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the methods described in the present disclosure may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the methods described in the present disclosure by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of data enhancement, comprising:

2. The method of claim 1, wherein the audio transforming the source audio to obtain target audio comprises:

Performing voice recognition on the source audio to obtain a recognized text result;

and converting the text result from text to voice to obtain the target audio, wherein at least one of tone, volume, speech speed and intonation of the target audio is different from that of the source audio.

3. The method of claim 1, wherein the obtaining the mapping between the target audio and the source audio comprises:

Acquiring the audio characteristics of the target audio as a first audio characteristic, and acquiring the audio characteristics of the source audio as a second audio characteristic;

And determining the mapping relation between the target audio and the source audio in a dynamic programming mode according to the first audio characteristics and the second audio characteristics.

4. The method of claim 3, wherein the determining, by dynamic programming, a mapping relationship between the target audio and the source audio comprises:

Constructing a two-dimensional matrix with the size of T, wherein T represents the number of frames included in the first audio feature, and S represents the number of frames included in the second audio feature;

According to the sequence from left to right in each row and the sequence from top to bottom in each row, the value of each element M (i, j) in the two-dimensional matrix is sequentially calculated in the following manner: acquiring the distance between an ith frame in the first audio feature and a jth frame in the second audio feature, determining the minimum value of M (i-1, j-1), M (i, j-1) and M (i-1, j), and taking the sum of the distance and the minimum value as the value of the element M (i, j), wherein i is more than or equal to 1 and less than or equal to T, and j is more than or equal to 1 and less than or equal to S;

determining the shortest path from M (1, 1) to M (T, S) according to the calculation result;

and determining the mapping relation between the target audio and the source audio according to the elements on the shortest path.

5. The method of claim 4, wherein the determining the mapping between the target audio and the source audio from the elements on the shortest path comprises:

For each element on the shortest path, respectively taking a frame in the first audio feature and a frame in the second audio feature corresponding to the element as a frame with a corresponding relation;

for each frame in the first audio feature, the following processing is performed:

Taking the frame as a frame to be processed;

In response to determining that the frame to be processed only has a corresponding relation with one frame in the second audio feature, and determining that the frame in the second audio feature, which has a corresponding relation with the frame to be processed, only has a corresponding relation with the frame to be processed, taking the frame in the second audio feature, which has a corresponding relation with the frame to be processed, as a mapping frame of the frame to be processed;

In response to determining that a correspondence exists between the frame to be processed and at least two frames in the second audio feature, determining a mapping frame of the frame to be processed in the second audio feature in a first predetermined manner;

In response to determining that at least two frames including the frame to be processed in the first audio feature have a corresponding relationship with the same frame in the second audio feature, determining a mapping frame of the frame to be processed in the second audio feature in a second predetermined manner;

And taking each frame in the first audio feature and the corresponding mapping frame as the determined mapping relation.

6. The method of claim 5, wherein the determining a mapped frame of the frame to be processed in the second audio feature comprises:

And in response to determining that the to-be-processed frame and at least two frames in the second audio feature have a corresponding relation, acquiring a mean value of frame numbers of frames in the second audio feature, which have a corresponding relation with the to-be-processed frame, wherein the frame numbers of any frame in the second audio feature are integers between 1 and S respectively, and taking the frame in the second audio feature corresponding to the mean value as a mapping frame of the to-be-processed frame.

7. The method of claim 5, wherein the determining a mapped frame of the frame to be processed in the second audio feature comprises:

In response to determining that at least two frames including the frame to be processed in the first audio feature have a corresponding relation with the same frame in the second audio feature, determining a mapping frame of the frame to be processed in the second audio feature by an interpolation mode in combination with the number of the at least two frames, the positions of the frame to be processed in the at least two frames, the frame number of the frame having a corresponding relation with the frame to be processed in the second audio feature and the frame number in the second audio feature, wherein the frame number of any frame in the second audio feature is an integer between 1 and S.

8. The method according to claim 6 or 7, wherein the determining, according to the mapping relationship and the face parameter corresponding to the source audio, the face parameter corresponding to the target audio includes:

taking the frame as a frame to be indexed, and determining a mapping frame of the frame to be indexed;

Responding to the fact that the frame sequence number of the mapping frame is an integer, and taking the face parameter corresponding to the mapping frame as the face parameter corresponding to the frame to be indexed;

In response to determining that the frame sequence number of the mapping frame is a non-integer, respectively rounding up and rounding down the frame sequence number of the mapping frame, respectively obtaining face parameters corresponding to the frame corresponding to the rounding up result and face parameters corresponding to the frame corresponding to the rounding down result, performing linear interpolation on the face parameters obtained in two times, and taking the linear interpolation result as the face parameters corresponding to the frame to be indexed;

and utilizing the facial parameters corresponding to each frame in the first audio feature to form the facial parameters corresponding to the target audio.

9. The method of any one of claims 1-7, further comprising:

And adding noise to the audio in the training data respectively aiming at any training data in the training data set.

10. The method of claim 9, wherein said adding noise to the audio in the training data comprises:

fusing random music into audio in the training data;

and/or, reducing the volume of the audio in the other training data and then fusing the audio into the audio in the training data;

And/or downsampling the audio in the training data and then upsampling the audio;

and/or fusing noise in a predetermined noise dataset into audio in the training data.

11. The method of any one of claims 1-7, further comprising:

Noise training data are constructed, and each piece of noise training data comprises the following components: noise and corresponding facial parameters, wherein the values of the facial parameters corresponding to the noise are all 0;

adding the noise training data to the training data set.

12. A data enhancement device, comprising: an audio conversion module and a data generation module;

13. The apparatus of claim 12, wherein,

The audio conversion module carries out voice recognition on the source audio to obtain a recognized text result, and carries out conversion from text to voice on the text result to obtain the target audio, wherein at least one of tone, volume, speech speed and intonation of the target audio is different from the source audio.

14. The apparatus of claim 12, wherein,

The data generation module acquires the audio characteristics of the target audio as a first audio characteristic, acquires the audio characteristics of the source audio as a second audio characteristic, and determines the mapping relation between the target audio and the source audio in a dynamic programming mode according to the first audio characteristic and the second audio characteristic.

15. The apparatus of claim 14, wherein,

The data generating module constructs a two-dimensional matrix with the size of T x S, T represents the number of frames included in the first audio feature, S represents the number of frames included in the second audio feature, and the value of each element M (i, j) in the two-dimensional matrix is sequentially calculated according to the sequence from left to right in each row and the sequence from top to bottom in each row by the following manner: obtaining the distance between the ith frame in the first audio feature and the jth frame in the second audio feature, determining the minimum value of M (i-1, j-1), M (i, j-1) and M (i-1, j), taking the sum of the distance and the minimum value as the value of the element M (i, j), wherein i is more than or equal to 1 and less than or equal to T, j is more than or equal to 1 and less than or equal to S, determining the shortest path from M (1, 1) to M (T, S) according to the calculation result, and determining the mapping relation between the target audio and the source audio according to the element on the shortest path.

16. The apparatus of claim 15, wherein,

The data generating module respectively takes the frames in the first audio feature and the frames in the second audio feature corresponding to the elements as the frames with corresponding relations for each element on the shortest path, and respectively carries out the following processing for each frame in the first audio feature: taking the frame as a frame to be processed; in response to determining that the frame to be processed only has a corresponding relation with one frame in the second audio feature, and determining that the frame in the second audio feature, which has a corresponding relation with the frame to be processed, only has a corresponding relation with the frame to be processed, taking the frame in the second audio feature, which has a corresponding relation with the frame to be processed, as a mapping frame of the frame to be processed; in response to determining that a correspondence exists between the frame to be processed and at least two frames in the second audio feature, determining a mapping frame of the frame to be processed in the second audio feature in a first predetermined manner; in response to determining that at least two frames including the frame to be processed in the first audio feature have a corresponding relationship with the same frame in the second audio feature, determining a mapping frame of the frame to be processed in the second audio feature in a second predetermined manner; and taking each frame in the first audio feature and the corresponding mapping frame as the determined mapping relation.

17. The apparatus of claim 16, wherein,

The data generation module is used for responding to the fact that the corresponding relation exists between the frame to be processed and at least two frames in the second audio features, obtaining the average value of frame serial numbers of frames in the corresponding relation between the frame to be processed in the second audio features, wherein the frame serial number of any frame in the second audio features is an integer between 1 and S, and the frames in the second audio features corresponding to the average value are used as mapping frames of the frames to be processed.

18. The apparatus of claim 16, wherein,

The data generating module is used for determining mapping frames of the frames to be processed in the second audio feature in an interpolation mode according to the number of the at least two frames, the positions of the frames to be processed in the at least two frames, the frame numbers of the frames with the corresponding relation with the frames to be processed in the second audio feature and the frame numbers of the frames in the second audio feature, wherein the mapping frames of the frames to be processed in the second audio feature are determined in the interpolation mode, and the frame numbers of any frame in the second audio feature are integers between 1 and S.

19. The device according to claim 17 or 18, wherein,

The data generation module performs the following processing for each frame in the first audio feature: taking the frame as a frame to be indexed, and determining a mapping frame of the frame to be indexed; responding to the fact that the frame sequence number of the mapping frame is an integer, and taking the face parameter corresponding to the mapping frame as the face parameter corresponding to the frame to be indexed; in response to determining that the frame sequence number of the mapping frame is a non-integer, respectively rounding up and rounding down the frame sequence number of the mapping frame, respectively obtaining face parameters corresponding to the frame corresponding to the rounding up result and face parameters corresponding to the frame corresponding to the rounding down result, performing linear interpolation on the face parameters obtained in two times, and taking the linear interpolation result as the face parameters corresponding to the frame to be indexed; and utilizing the facial parameters corresponding to each frame in the first audio feature to form the facial parameters corresponding to the target audio.

20. The apparatus of any one of claims 12-18, further comprising:

and the noise processing module is used for adding noise to the audio in the training data respectively aiming at any training data in the training data set.

21. The apparatus of claim 20, wherein,

The noise processing module fuses random music into the audio in the training data, and/or fuses the audio in the training data after reducing the volume of the audio in the other training data, and/or fuses the audio in the training data into the audio in the training data after downsampling, and/or fuses the noise in the preset noise data set into the audio in the training data.

22. The apparatus of claim 20, wherein,

The noise processing module is further configured to construct noise training data, where each piece of noise training data includes: noise and corresponding facial parameters, wherein the values of the facial parameters corresponding to the noise are all 0, and the noise training data are added into the training data set.

23. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-11.

25. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of any of claims 1-11.