CN114299909A - Audio data processing method, device, equipment and storage medium - Google Patents

Audio data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN114299909A
CN114299909A CN202110949163.4A CN202110949163A CN114299909A CN 114299909 A CN114299909 A CN 114299909A CN 202110949163 A CN202110949163 A CN 202110949163A CN 114299909 A CN114299909 A CN 114299909A
Authority
CN
China
Prior art keywords
audio data
sample
sample audio
characteristic information
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110949163.4A
Other languages
Chinese (zh)
Inventor
张泽旺
李新辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110949163.4A priority Critical patent/CN114299909A/en
Publication of CN114299909A publication Critical patent/CN114299909A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses an audio data processing method, an audio data processing device, audio data processing equipment and a storage medium, and relates to an artificial intelligence related machine learning technology, wherein the method comprises the following steps: obtaining at least two sample audio data and sample audio data Y of the at least two sample audio dataiAssociated object information; for sample audio data YiCarrying out feature extraction to obtain sample audio data YiText characteristic information and score characteristic information; for sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain sample audio data YiSample audio feature information of (1); using sample audio data YiSample audio feature information and sample audio data YiAnd adjusting the candidate audio synthesis model by the associated object information to obtain the target audio synthesis model. By the method and the device, the audio data can be effectively improvedThe pronunciation stability of the audio data, and the quality of the audio data.

Description

Audio data processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of machine learning technology in artificial intelligence, and in particular, to an audio data processing method, apparatus, device, and storage medium.
Background
With the development of internet technology, the demand for audio data is increasing day by day, but the traditional audio data needs to manually convert text information and music score information into audio data, and the efficiency is low, so that the audio synthesis technology is born, the audio synthesis technology learns the text characteristic information and the music score characteristic information of the audio data to be synthesized through an audio synthesis model to automatically generate the audio data, and the audio data generation efficiency can be improved, so that the technology is widely applied to the fields of pan-intelligent dubbing, virtual anchor, intelligent home, intelligent robot and related intelligence. However, the difference between the distribution of the text characteristic information and the music score characteristic information of the audio data to be synthesized is large, so that the pronunciation stability of the generated audio data is poor, and further, the quality of the audio data is poor.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present application is to provide an audio data processing method, an apparatus, a device and a storage medium, which can effectively improve the pronunciation stability of audio data and the quality of audio data.
An aspect of the present embodiment provides an audio data processing method, including:
obtaining at least two sample audio data, and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, Q is the number of objects in the at least two sample objects;
for the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
for the sample audio data YiText characteristic message ofInformation and music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
using the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
An aspect of an embodiment of the present application provides an audio data processing apparatus, including:
an obtaining module, configured to obtain at least two sample audio data and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiTo which a sample object P belongsnThe at least two sample audio data belong to at least two sample objects, i is less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is less than or equal to Q, Q is the number of objects in the at least two sample objects;
an extraction module for extracting the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
a processing module for processing the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
an adjustment module for employing the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
One aspect of the present application provides a computer device, comprising: a processor and a memory;
wherein, the memory is used for storing computer programs, and the processor is used for calling the computer programs to execute the following steps:
obtaining at least two sample audio data, and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, Q is the number of objects in the at least two sample objects;
for the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
for the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
using the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
An aspect of the embodiments of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, perform the following steps:
obtaining at least two sample audio data, and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio frequencyAccording to YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, Q is the number of objects in the at least two sample objects;
for the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
for the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
using the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
In the application, by acquiring at least two sample audio data and extracting text characteristic information and score characteristic information of each sample audio data in the at least two sample audio data, the pronunciation of the synthesized audio data is easy to be unstable due to the difference in the distribution of the text characteristic information and the score characteristic information of the sample audio data; therefore, the audio characteristic information of each sample audio data is obtained by normalizing the text characteristic information and the music score characteristic information of each sample audio data, which is beneficial to reducing the distribution difference of the text characteristic and the music score characteristic information of the sample audio data and further improving the pronunciation stability of the synthesized audio data. Further, the above sample audio data Y may be employediSample audio feature information and the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; at least two sample audio data belong to at least two sample objects, i.e. by sample tones employing a plurality of sample objectsThe audio data train the candidate audio model, so that the diversity of the training corpus is improved, the robustness of the target audio synthesis model is improved, and the problem of instability at high pitch, low pitch and lingering is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a block diagram of an audio data processing system according to the present application;
FIG. 2 is a schematic diagram of a scenario in which data interaction is performed between devices in an audio data processing system provided in the present application;
FIG. 3 is a schematic diagram of a scenario of data interaction between devices in an audio data processing system according to the present application;
FIG. 4 is a flow diagram of an audio data processing method provided herein;
FIG. 5 is a schematic diagram of the acoustic model, duration model and candidate audio synthesis model provided in the present application;
FIG. 6 is a schematic diagram of a scene of frequency domain transformation of candidate audio feature information according to the present application;
FIG. 7 is a flow diagram of an audio data processing method provided herein;
FIG. 8 is a schematic diagram of a scenario for adjusting a candidate audio synthesis model according to the present application;
fig. 9 is a schematic structural diagram of an audio data processing apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, in the process of synthesizing audio data, an audio synthesis model is directly adopted to learn text characteristic information and music score characteristic information of the audio data to be synthesized, so as to generate the audio data. The difference between the distribution of the text characteristic information and the music score characteristic information of the audio data to be synthesized is large, so that the pronunciation stability of the generated audio data is poor, and further, the quality of the audio data is poor. Based on the above, the method and the device utilize the machine learning technology related to artificial intelligence to perform feature extraction on at least two sample audio data to obtain text feature information and music score feature information of each sample audio data in the at least two sample audio data, and perform normalization processing on the text feature information and the music score feature information of each sample audio data to obtain the audio feature information of each sample audio data, so that the distribution difference of the text feature and the music score feature of each sample audio data can be reduced. Then, sample audio characteristic information of each sample audio data and object information related to each sample audio data can be adopted to adjust the candidate audio synthesis model to obtain a target audio synthesis model; the distribution difference of the text characteristic and the music score characteristic of each sample audio data is reduced, so that the pronunciation stability of the audio data generated by the target audio synthesis model is improved; meanwhile, the candidate audio synthesis model is adjusted by adopting the sample audio data of a plurality of sample users, so that the robustness of the target audio synthesis model is improved, and the problem of instability at high pitch, low pitch and lingering is avoided.
So-called Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
The Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
In order to facilitate a clearer understanding of the present application, an audio data processing system implementing the audio data processing method of the present application is first introduced, as shown in fig. 1, the audio data processing system includes a server and a terminal, as shown in fig. 1.
The terminal may refer to a user-oriented device, and the terminal may include an audio data application platform (i.e., an audio data application program) for playing audio data; the audio data platform may refer to an audio website platform (such as forum, post), a social application platform, a shopping application platform, a content interaction platform (such as an audio playing application platform), and the like. The server may be a device for providing an audio data background service, and specifically, the server may be configured to generate audio data according to the text feature information and the music score feature information, and upload the audio data to an audio data platform, so that a user may play the audio data in the audio data platform.
The server may be an independent physical server, a server cluster or a distributed system formed by at least two physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a sound box with a screen, a smart watch, a smart television, and the like. Each terminal and each server may be directly or indirectly connected through a wired or wireless communication manner, and the number of the terminals and the number of the servers may be one or at least two, which is not limited herein. The service scenario applicable to the audio data processing data system may specifically include: an intelligent robot, intelligent dubbing, song singing, virtual anchor, virtual education, AI customer service, TTS (Text To Speech) cloud service, and the like, wherein service scenes suitable for the audio data processing system are not listed one by one here.
The audio data referred to in this application may refer to digitized sound, and the audio data includes text characteristic information and score characteristic information, and the text characteristic information of the audio data refers to specific content of the sound. The score characteristic information of the audio data is pronunciation information indicating specific contents of sound, and when the audio data is singing data of a song, the pronunciation information may include phonemes, phoneme types, notes, note values, and the like; when the audio data is not the singing data of a song, the pronunciation information may include phonemes, phoneme types, and the like. It can be understood that the specific content of the audio data is related to the service scene, for example, in a song singing scene, the audio data refers to the singing data of the song sung by the object; for another example, in a virtual teaching assistant scenario, the audio data may be speech data of what the object teaches; for another example, in an intelligent dubbing scene, audio data may refer to dubbing whose object is a movie or television work, and so on.
For easy understanding, please refer to fig. 2 and fig. 3, which are schematic views of a data interaction scenario provided in an embodiment of the present application. As shown in fig. 2 and fig. 3, the data exchange process is described by taking a business scenario in which a song sings as an example, in which audio data may be referred to as singing data, and the data exchange process includes a training process for a candidate audio synthesis model and a process for generating target singing data.
As shown in fig. 2, the training process for the candidate audio synthesis model includes the following steps 1-3:
1. the server obtains text characteristic information and music score characteristic information of each sample singing data. The server device may obtain at least two sample singing data and object information associated with each sample singing data from the terminal. Further, the server can respectively perform language conversion on each sample singing data to obtain lyric information of each sample singing data, and respectively determine the lyric information of each sample singing data as text characteristic information of each sample singing data; and then, respectively carrying out acoustic analysis on the singing data of each sample to obtain the music score characteristic information of the singing data of each sample.
Understandably, the sample singing data Y of the at least two sample singing dataiThe associated object information is used for indicating the sample singing data YiTo which a sample object P belongsnThe at least two sample singing data belong to at least two sample objects. i is less than or equal to M, M is the number of the sample singing data in the at least two sample singing data, n is less than or equal to Q, and Q is the number of the at least two sample objectsThe number of objects. That is, at least two sample singing data are obtained by singing a plurality of sample objects, and a sample object may refer to any user who has issued the singing data in the audio data application platform, and thus, the at least two sample singing data may specifically refer to the singing data of a song sung by the user in a historical time period (e.g., a last week, a last month). In one embodiment, the songs corresponding to the respective sample singing data may be the same, for example, the at least two sample singing data are singing data obtained by singing song a by different user objects. In another embodiment, the songs corresponding to the sample singing data may be different, for example, the sample singing data YiAs a sample object PnSinging data obtained by singing song A, and the sample singing data Yi+1As a sample object Pn+1Singing song A to obtain singing data; therefore, the diversity of the sample singing data is improved, and further, the robustness of the audio synthesis model is improved.
2. The server generates audio characteristic information of each sample singing data. In one embodiment, the server may perform splicing processing on the feature information, and then perform normalization processing on the feature information to obtain audio feature information of each sample singing data; specifically, the server may sing the sample singing data YiThe text characteristic information and the music score characteristic information are spliced to obtain spliced characteristic information, and the spliced characteristic information is normalized to obtain the sample singing data YiBy analogy, the audio characteristic information of each sample singing data can be obtained. In another embodiment, the server may perform normalization processing on the feature information, and then perform splicing processing on the feature information to obtain audio feature information of each sample singing data; specifically, the server may sing the sample singing data YiNormalizing the text characteristic information to obtain normalized text characteristic information, and singing the sample singing data YiThe characteristic information of the music score is normalized to obtain normalizationProcessed score feature information; then, the normalized text characteristic information and the normalized music score characteristic information are spliced to obtain the sample singing data YiBy analogy, the audio characteristic information of each sample singing data can be obtained. It should be noted that the normalization processing refers to adjusting the text characteristic information and the score characteristic information to reduce the difference between the distribution of the adjusted text characteristic information and the distribution of the score characteristic information; for example, the text feature information and the score feature information are adjusted to the same data interval, e.g. the data interval is [0,1 ]]。
3. The server may train the candidate audio synthesis model using the audio feature information of each sample singing data and the object information of each sample object. The server can sing data Y according to the sampleiObtaining the sample object PnTone color feature information of; specifically, singing data Y is performed according to the sampleiQuerying whether the tone color database includes the sample object PnIf the sample object P is searched in the tone databasenThe candidate tone color feature information of (2) is used as the sample object PnTone color feature information of; sample object PnThe tone color feature information of (2) is used for reflecting the singing style of the sample object. If the sample object P is not queried in the tone color databasenThe server can sing the data Y according to the sampleiDetermines the object belonging to the sample object P from at least two sample singing datanSubject to the sample object PnThe sample singing data intercepts a sample singing data segment, and performs tone characteristic extraction on the sample singing data segment to obtain a sample object PnAnd can represent the sample object PnThe tone characteristic information is added into the tone database, so that the sample object P can be acquired quickly and subsequentlynThe efficiency of obtaining the tone characteristic information of the sample object can be improved.
Further, the server may employ the candidate audio synthesis model to the sample object PnTone characteristic information of and the sample singing data YiPredicting to obtain a sample object PnThe predicted singing data is synthesized singing data. Counting the sample object PnPredicted singing data and the sample singing data YiAnd from which a prediction error of the candidate audio synthesis model may be determined, the prediction error reflecting the accuracy of the audio synthesis of the candidate audio synthesis model. If the prediction error of the candidate audio synthesis model is in a convergence state, the candidate audio synthesis model can be used as a target audio synthesis model; and if the prediction error of the candidate audio synthesis model is not in a convergence state, adjusting the candidate audio synthesis model according to the prediction error to obtain an adjusted candidate audio synthesis model, and determining the adjusted candidate audio synthesis model as the target audio synthesis model.
As shown in fig. 3, the process of generating the target singing data includes the following steps 4-6:
4. the server can obtain the reference singing data segment of the target object, and the text characteristic information and the music score characteristic information of the target singing data to be synthesized. The server can obtain historical singing data belonging to a target object, intercept a singing data segment from the historical singing data, and determine the intercepted singing data segment as a reference singing data segment; the target object may refer to a user or a virtual user (e.g., a robot, a smart speaker, etc.). The target singing data refers to singing data to be synthesized, the target singing data specifically refers to singing data of a certain song, lyrics of the song can be called as text characteristic information of the target singing data, and a music score of the song can be called as music score characteristic information of the target singing data.
5. The server can obtain the audio characteristic information of the target singing data. In one embodiment, the server may perform splicing processing on the feature information, and then perform normalization processing on the feature information to obtain audio feature information of the target singing data; specifically, the server may perform splicing processing on the text characteristic information and the music score characteristic information of the target singing data to obtain spliced characteristic information, and perform normalization processing on the spliced characteristic information to obtain audio characteristic information of the target singing data. In another embodiment, the server may perform normalization processing on the feature information, and then perform splicing processing on the feature information to obtain audio feature information of the target singing data; specifically, the server may perform normalization processing on the text characteristic information of the target singing data to obtain normalized text characteristic information, and perform normalization processing on the score characteristic information of the target singing data to obtain normalized score characteristic information; and then, splicing the text characteristic information after the normalization processing and the music score characteristic information after the normalization processing to obtain the audio characteristic information of the target singing data.
6. And the server generates target singing data by adopting a target audio synthesis model. The server may synthesize the tone characteristic information of the target object and the audio characteristic information of the target singing data by using a target audio synthesis model, so as to obtain target singing data belonging to the target object. After the server acquires the target singing data, the target singing data can be sent to the terminal belonging to the target object, and the terminal belonging to the target object can play the audio data.
In summary, in the training process of the candidate audio synthesis model, the text feature information and the music score feature information of the sample singing data are normalized, so that the distribution difference between the text feature information and the music score feature information of the sample singing data can be reduced, and the pronunciation stability of the singing data generated by the target audio synthesis model can be improved. And the candidate audio model is trained by adopting the sample singing data of a plurality of sample users, so that the robustness of the target audio synthesis model is improved, and the problem of instability at high pitch, low pitch and lingering is avoided. In the process of generating the target singing data, the text characteristic information and the music score characteristic information of the target singing data are normalized, so that the distribution difference of the text characteristic information and the music score characteristic information of the target singing data can be reduced, and the pronunciation stability of the synthesized target singing data can be improved. And through the reference singing data segment of the target object, the target singing data is synthesized, so that the tone customization of the target object can be realized by adopting a small amount of reference singing data, the cost of synthesizing the singing data can be reduced, and the quality of the synthesized singing data is improved. That is, the target audio synthesis model can customize the tone of the target object based on the existing singing data segment of the target object, and automatically synthesize the singing data close to the singing style of the target object, so that the target object can be endowed with more comprehensive singing capability. Meanwhile, the target audio synthesis model can be used for cultivating virtual idols and providing entertainment and appreciation values capable of enabling the fans to dance well at any time and any place.
It should be noted that the above-mentioned training process of the candidate audio synthesis model and the process of generating the target singing data may be executed by the server in fig. 1, or may be executed by any terminal in fig. 1, or of course, may be executed by both the server and the terminal, and the present application is not limited thereto. The terminal executes the process of training the candidate audio synthesis model and the process of generating the target singing data, and the process of training the candidate audio synthesis model and the process of generating the target singing data can be executed by referring to the server in fig. 2, and repeated parts are not repeated.
Particularly, when the terminal and the server jointly execute the training process of the candidate audio synthesis model and the process of generating the target singing data, the server and the terminal respectively execute different steps in the training process of the candidate audio synthesis model and the process of generating the target singing data, and the distributed system consisting of the terminal and the server synthesizes the singing data, so that the singing data processing pressure of each device can be effectively reduced, and the efficiency of generating the singing data is improved. For example, the server may perform a training process on the candidate audio synthesis model, and when the server completes the training process on the candidate audio synthesis model, the target audio synthesis model is obtained and sent to any one of the terminals in fig. 1. The terminal can execute the process of generating the target singing data according to the target audio synthesis model.
Further, please refer to fig. 4, which is a flowchart illustrating an audio data processing method according to an embodiment of the present application. As shown in fig. 4, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. The audio data processing method may include the following steps S101 to S104:
s101, obtaining at least two sample audio data and sample audio data Y in the at least two sample audio dataiAssociated object information; the above sample audio data YiThe associated object information is used to indicate the sample audio data YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, and Q is the number of objects in the at least two sample objects.
In one embodiment, the computer device may obtain at least two sample audio data from the terminal, and sample audio data Y of the at least two sample audio dataiAssociated object information. In another embodiment, the computer device may obtain at least two sample audio data from the internet, and sample audio data Y of the at least two sample audio dataiAssociated object information. The plurality of sample audio data belong to at least two sample objects, namely, the sample audio data of the plurality of sample objects are obtained, so that the diversity of the sample audio data (namely, the diversity of the training corpora) can be effectively improved, and further, the robustness of the target audio synthesis model is favorably improved.
S102, for the sample audio data YiExtracting the characteristics to obtain the audio frequency number of the sampleAccording to YiText characteristic information and score characteristic information.
In this application, the computer device may compare the sample audio data YiPerforming language conversion processing to obtain the sample audio data YiFor example, the computer device may employ a language model for the sample audio data YiPerforming language identification to obtain the sample audio data YiThe text characteristic information of (1). Further, the computer device may perform acoustic analysis on the sample audio data to obtain the sample audio data YiThe score feature information of.
S103, for the sample audio data YiNormalizing the text characteristic information and the music score characteristic information to obtain the sample audio data YiThe sample audio feature information of (1).
In the present application, the audio data Y is obtained from the above sampleiThe distribution of the score feature information and the text feature information of (2) is different, which easily causes unstable pronunciation of the synthesized audio data. Based on this, the computer device can compare the sample audio data Y with the sample audio data YiNormalizing the text characteristic information and the music score characteristic information to obtain the sample audio data YiThe sample audio feature information of (2) can reduce the sample audio data YiThe distribution difference of the music score characteristic information and the text characteristic information is beneficial to improving the pronunciation stability of the synthesized audio data.
It should be noted that the computer device may adopt the following mode a or mode b to the above sample audio data YiNormalizing the text characteristic information and the music score characteristic information to obtain the sample audio data YiThe sample audio feature information of (1). In the mode a, the computer device may firstly perform normalization processing on the feature information, and then perform splicing processing on the feature information to obtain the sample audio data YiThe sample audio feature information of (1). Mode b: the computer device may perform splicing processing on the feature information first, and then perform normalization processing on the feature information to obtain the sample audio data YiThe sample audio feature information of (1).
Optionally, when the computer device adopts the mode a to the sample audio data YiThe computer device may normalize the text characteristic information and the score characteristic information, respectively, and may further normalize the sample audio data YiPerforming normalization processing on the text characteristic information and the music score characteristic information to obtain normalized text characteristic information and normalized music score characteristic information; splicing the normalized text characteristic information and the normalized music score characteristic information to obtain the sample audio data YiThe sample audio feature information of (1).
Optionally, when the computer device adopts the mode b to the sample audio data YiWhen the text feature information and the score characteristic information are normalized, the step S103 may include the following steps S11 and S12:
s11, for the sample audio data YiAnd splicing the text characteristic information and the music score characteristic information to obtain spliced characteristic information.
s12, normalizing the spliced feature information to obtain the sample audio data YiThe sample audio feature information of (1).
In steps s11 to s12, the computer apparatus may perform the processing on the sample audio data Y in accordance with the correspondence between the text feature information and the score feature informationiPerforming splicing processing on the text characteristic information and the music score characteristic information to obtain spliced characteristic information; for example, the text feature information includes a plurality of words, and the score feature information corresponding to each word is spliced to a position adjacent to each word to obtain feature information after the splicing. Further, the computer device may perform normalization processing on the feature information after the splicing processing to obtain the sample audio data YiSample audio feature information of (1); the sample audio data Y can be reducediThe distribution difference of the music score characteristic information and the text characteristic information is beneficial to improving the pronunciation stability of the synthesized audio data.
It should be noted that the present application includes an acoustic model, a duration model, and a candidate audio synthesis model, and each step in the present application may be performed by a corresponding model. Specifically, as shown in fig. 5, the acoustic model is a model for generating audio characteristic information of audio data; the acoustic model may include a normalization layer, a residual encoder, an attention encoder, a gradient inversion layer, an object classification layer, an attention decoder, a smoothing layer, and so forth; the normalization layer can be used for performing normalization processing on the text characteristic information and the music score characteristic information of the audio data; the residual encoder is used to obtain a timbre feature encoding value of the object, i.e. a characterization of the identity (i.e. the singing style) of the object, and the attention encoder is used to obtain a relation feature encoding value reflecting the context of the audio data. The gradient inversion layer is used for multiplying gradient errors of the object classification layer by an inverse coefficient (namely a negative number) so as to realize inverse action on an encoder in the acoustic model; the object classification layer is used for determining the category of the sample object, and the smoothing layer is used for smoothing the energy characteristic information of the sample audio data. The duration model is used for predicting the frame length of a pronunciation unit of the audio data; the candidate audio synthesis model is used to synthesize audio data, and may specifically be a vocoder model. Optionally, after the computer device normalizes the feature information, the computer device may encode the feature information to obtain sample audio data YiKey feature information of (1); specifically, the step s12 may include the following steps s21 to s 23:
s21, normalizing the spliced feature information to obtain the sample audio data YiThe candidate sample audio feature information of (1).
s22, for the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiThe audio coding value of (1).
s23 sample audio data YiIs determined as the sample audio data YiThe sample audio feature information of (1).
In steps s21 to s23, countingThe computer device can perform normalization processing on the spliced characteristic information to obtain the sample audio data YiThen, for the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiThe sample audio data YiThe audio coding value of (a) may refer to a value reflecting the sample audio data YiMay include a value reflecting the encoding of the key feature information, the key feature information may include a value reflecting the sample audio data YiAnd reflects the sample audio data YiAnd tone characteristic information of the sample object. Further, the sample audio data Y may beiIs determined as the sample audio data YiThe sample audio feature information of (1). By applying to the above sample audio data YiThe candidate sample audio characteristic information is encoded, which is beneficial to extracting and obtaining the sample audio data YiKey feature information of (1).
Optionally, the sample audio data YiThe candidate sample audio characteristic information belongs to a time domain signal, and the sample audio data Y is difficult to acquire due to the fact that the time domain signal is relatively complexiThe frequency domain signal is simpler, and can provide richer information content, and the sample audio data Y is easy to obtainiKey feature information of (1). Therefore, the computer device can compare the sample audio data Y with the sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain conversion processing to obtain the sample audio data YiKey feature information of (1). Specifically, the step s21 may include the following steps s31 to s 33:
s31, for the sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation to obtain the sample audio data YiFrequency domain characteristic information of (1).
s32, based on the sample audio data YiThe frequency domain feature information of the sample audio data Y is generatediEnergy characteristic information of (1).
s33, for the sample audio data YiThe energy characteristic information is coded to obtain the sample audio data YiThe audio coding value of (1).
In steps s31 to s33, as shown in FIG. 6, the computer device may compare the sample audio data Y with the above-mentioned sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation to obtain the sample audio data YiFrequency domain feature information of (1); the above sample audio data YiIs used for reflecting the sample audio data YiIs used for reflecting the sample audio data Y and a frequency parameteriThe loudness of (a) refers to a parameter reflecting the intensity of sound. That is, the sample audio data YiThe larger the vibration amplitude is, the larger the loudness is, and the stronger the sound is; otherwise, the sample audio data YiThe smaller the amplitude of the vibration, the smaller the loudness and the weaker the sound. The frequency parameter is used for reflecting the sample audio data YiPitch, which is a parameter for reflecting the level of sound, sample audio data YiHigh, then the pitch is high; otherwise, the sample audio data YiLow, the tone is low. Further, the computer device may be configured to sample audio data Y based on the above-mentioned sample audio data YiThe frequency domain feature information of the sample audio data Y is generatediThe energy spectrum curve of (a); the energy spectrum curve is used for reflecting the sample audio data YiThe relationship between the frequency parameter and the energy parameter of (2) can be obtained from the energy spectrum curveiEnergy characteristic information of (a); for the sample audio data YiThe energy characteristic information is coded to obtain the sample audio data YiThe audio coding value of (1). By applying to the above sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation, so that the obtained sample audio data Y can be reducediThe difficulty of key feature information. Alternatively, since the frequency that can be sensed by the human ear is limited, and the audio data corresponding to the frequency that cannot be sensed by the human ear is usually called noise, the computer device may perform the above-mentioned processing on the sample audio data YiEnergy characteristic information ofFiltering; specifically, the step s33 may include the following steps s41 to s 43:
s41, for the sample audio data YiFiltering the energy characteristic information to obtain the sample audio data YiThe effective energy characteristic information of (1).
s42, for the sample audio data YiDiscretizing the effective energy characteristic information to obtain the sample audio data YiDiscrete energy characteristic information of (2).
s43, for the sample audio data YiThe discrete energy characteristic information is coded to obtain the sample audio data YiThe audio coding value of (1).
In steps s41 to s43, the computer device may generate a filter according to auditory characteristics of human ears, and apply the filter to the sample audio data YiFiltering the energy characteristic information to obtain the sample audio data YiThe effective energy characteristic information of (1), that is, the energy characteristic information of which the frequency parameter belongs to the filter, is taken as the effective energy characteristic information; and taking the energy characteristic information of which the frequency parameter belongs to the outside of the filter as invalid energy characteristic information. Further, the sample audio data Y may be processediDiscretizing the effective energy characteristic information to obtain the sample audio data YiFurther, the discrete energy characteristic information of (2) may be applied to the sample audio data YiThe discrete energy characteristic information is coded to obtain the sample audio data YiThe audio coding value of (1). By applying to the above sample audio data YiThe energy characteristic information is filtered, noise interference can be avoided, and the acquisition of the sample audio data Y is improvediThe accuracy of the audio characteristic information is improved, meanwhile, the follow-up invalid noise processing is avoided, and the processing resources of the computer equipment can be saved. At the same time, by the above-mentioned pair of sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation, filtering processing and discretization processing to finally obtain the sample audio data YiThe audio coding value of (a) may specifically be a value reflecting the above-mentioned sample audio numberAccording to YiThe coding value of Bark cepstrum Coefficient (BFCC) characteristic, which means Bark level cepstrum Coefficient, extracts a set of cepstrum characteristics closer to the hearing sensitivity of human ears by dividing the audio frequency band by the Bark level. Compared with the encoding value based on the Mel spectrum, the encoding value based on the BFCC characteristics has explicit energy, and can make the energy difference between sentences inconspicuous under the condition of single sentence synthesis, thereby making the synthesized audio data sound more natural and pleasant.
Optionally, the computer device may adopt an attention coding mode and a residual coding mode to the sample audio data YiThe discrete energy characteristic information of the data is encoded; specifically, the step s43 may include the following steps s51 to s 53:
s51, for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre characteristic encoding values of (1).
s52, for the sample audio data YiThe discrete energy characteristic information is attention-coded to obtain the audio data Y for reflecting the sampleiThe context of (1) encodes a value.
s53, sampling the audio data YiThe timbre characteristic coding value and the relation characteristic coding value are spliced to obtain the sample audio data YiThe audio coding value of (1).
In steps s 51-s 53, as shown in FIG. 5, the computer device may employ a Residual Encoder (Residual Encoder) for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre characteristic encoding value of, i.e. the above sample audio data YiThe timbre feature encoding values of (a) are used to reflect the sample object PnThe tone color characteristic information of (1). Then, the sample audio data Y is processed by a self-attention encoderiThe discrete energy characteristic information is attention-coded to obtain the audio data Y for reflecting the sampleiThe context feature encoding value of (1); as such, the attention encoder may refer to an encoder consisting of a self-attention mechanism in a translation model Transformer. Further, the sample audio data Y may beiThe timbre characteristic coding value and the relation characteristic coding value are spliced to obtain the sample audio data YiThe audio coding value of (1). By obtaining sample audio data YiThe relation characteristic coding value and the tone characteristic value are beneficial to providing more effective training corpora for training candidate audio synthesis models, and the robustness of the target audio synthesis model can be ensured.
Alternatively, the computer device may compare the sample audio data Y with the sample audio data YiCarrying out compensation processing on the candidate tone characteristic coding value; specifically, the step s51 may include the following steps s61 to s 63:
s61, for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe candidate timbre feature encoding values of (1).
s62, based on the sample audio data YiAssociated object information for the sample audio data YiThe candidate tone color feature coding value is compensated to obtain the sample audio data YiThe compensated tone color feature encoding values.
s63, sampling the audio data YiIs determined as the sample audio data YiThe timbre characteristic encoding values of (1).
In steps s 61-s 63, the computer device may compare the sample audio data Y with the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe candidate timbre feature coding value of (1); the above sample audio data YiThe candidate timbre feature encoding value of (a) may refer to an acoustically related dynamic feature for reflecting the sample object, i.e., the candidate timbre feature encoding value varies with changes in the sample audio data. To realize multi-user audio synthesis, the audio data Y can be obtained from the samplesiAssociated object information for the sample audio data YiThe candidate tone color feature coding value is compensated to obtain the sample audio data YiThe compensated tone color feature coding value of, the sample audio data YiThe associated object information refers to static features for reflecting the sample object. Further, the sample audio data Y may beiIs determined as the sample audio data YiThe timbre characteristic coding value of; the above sample audio data YiThe tone color feature coding value can be used for reflecting the dynamic tone color feature and the static feature of the sample object, thereby being beneficial to realizing multi-person audio synthesis and providing richer training expectation for the subsequent training process of the candidate audio synthesis model.
Optionally, the relation feature encoding value adopts an attention encoder to encode the sample audio data YiThe method further comprises the following steps s 71-s 74:
s71 applying gradient inversion layer to convert the sample audio data Y into sample audio data YiThe relationship feature encoding value of (a) is passed to the object classification layer.
s72, applying the object classification layer to the sample audio data YiThe relation characteristic code value of (2) is identified to obtain the sample object PnThe category (2).
s73, based on the sample object PnDetermines the gradient error of the above-mentioned attention encoder.
s74, weighting the gradient error of the attention encoder by using an inverse coefficient to obtain the coding error of the attention encoder, and adjusting the attention encoder according to the coding error.
In steps s71 to s74, as shown in fig. 5, the self-attention encoder and the residual encoder both belong to an encoder of an acoustic model, and a gradient inversion layer is added to the output of the encoder of the acoustic model so as to make the encoder of the acoustic model independent of the sample object, and an object classifier (i.e., an object classification layer) is connected after the gradient inversion layer. Enabling classification of object classifiers by minimizing cross-entropy object classification loss functionsThe capacity is optimized, and simultaneously, due to the existence of the gradient inversion layer, the encoder of the acoustic model is independent of the sample object. The characteristic that an encoder of the acoustic model is irrelevant to a sample object can greatly improve the diversity of training corpora, so that the stability and the robustness of audio data synthesized by multiple persons are improved. Specifically, the computer device may use a gradient inversion layer to invert the sample audio data YiIs passed to an object classification layer, and then the sample audio data Y is subjected to the object classification layeriThe relation characteristic code value of (2) is identified to obtain the sample object PnClass of (2), the sample object PnMay refer to reflecting the sample object PnThe timbre characteristics of (1). Further, according to the above sample object PnDetermines the gradient error of the attention encoder, i.e. acquires the sample object PnAnd the class of the neighboring sample object, from which the gradient error of the above-mentioned attention encoder is determined. Furthermore, the gradient error of the attention encoder is weighted by an inverse coefficient (i.e. a negative number) to obtain a coding error of the attention encoder, and the attention encoder is adjusted according to the coding error. By adding a gradient inversion layer and an object classification layer, the reverse action on the encoder is realized; the encoder is independent of the sample object, and the diversity of the training corpus is improved, so that the stability and the robustness of the audio data synthesized by multiple persons are improved.
S104, adopting the sample audio data YiSample audio feature information and the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the above-mentioned target audio synthesis model is used to synthesize target audio data of the target object.
In the present application, the candidate audio synthesis model may refer to an audio synthesis model with a relatively low audio synthesis accuracy, so as to improve the audio synthesis accuracy of the candidate audio synthesis model; the computer device may train the candidate audio synthesis models. In particular, the method comprises the following steps of,the computer device can use the sample audio data YiSample audio feature information and the sample audio data YiAnd the associated object information is a training corpus, the candidate audio synthesis model is trained (namely adjusted), and the candidate audio synthesis model obtained by training is determined as the target audio synthesis model. The candidate audio synthesis model is trained by adopting diversified training corpora, so that the robustness of the trained target audio synthesis model and the stability of the synthesized audio are improved.
In the application, by acquiring at least two sample audio data and extracting text characteristic information and score characteristic information of each sample audio data in the at least two sample audio data, the pronunciation of the synthesized audio data is easy to be unstable due to the difference in the distribution of the text characteristic information and the score characteristic information of the sample audio data; therefore, the audio characteristic information of each sample audio data is obtained by normalizing the text characteristic information and the music score characteristic information of each sample audio data, which is beneficial to reducing the distribution difference of the text characteristic and the music score characteristic information of the sample audio data and further improving the pronunciation stability of the synthesized audio data. Further, the above sample audio data Y may be employediSample audio feature information and the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the at least two sample audio data belong to at least two sample objects, namely, the candidate audio model is trained by adopting the sample audio data of the plurality of sample objects, so that the diversity of training corpora is improved, the robustness of the target audio synthesis model is improved, and the problem of instability at high pitch, low pitch and lingering is avoided.
Further, please refer to fig. 7, which is a flowchart illustrating an audio data processing method according to an embodiment of the present application. As shown in fig. 7, the method may be performed by a computer device, which may refer to the terminal in fig. 1, or the computer device may refer to the server in fig. 1, or the computer device includes the terminal and the server in fig. 1, that is, the method may be performed by both the terminal and the server in fig. 1. Wherein the method may at least comprise the following S201-S208:
s201, obtaining at least two sample audio data and sample audio data Y in the at least two sample audio dataiAssociated object information; the above sample audio data YiThe associated object information is used to indicate the sample audio data YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, and Q is the number of objects in the at least two sample objects.
S202, for the sample audio data YiExtracting characteristics to obtain the sample audio data YiText characteristic information and score characteristic information.
S203, for the sample audio data YiNormalizing the text characteristic information and the music score characteristic information to obtain the sample audio data YiThe sample audio feature information of (1).
S204, adopting the sample audio data YiSample audio feature information and the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the above-mentioned target audio synthesis model is used to synthesize target audio data of the target object.
It should be noted that, in the present application, the explanation of step S201 may refer to the explanation of step S101 in fig. 3, the explanation of step S202 may refer to the explanation of step S102 in fig. 3, the explanation of step S203 may refer to the explanation of step S103 in fig. 3, the explanation of step S204 may refer to the explanation of step S104 in fig. 3, and repeated parts are not repeated.
Optionally, the step S204 may include the following steps S81 to S84:
s81 based on the sample soundFrequency data YiAssociated object information for obtaining the sample object PnThe tone color characteristic information of (1).
s82 applying the candidate audio synthesis model to the sample audio data YiThe sample audio feature information and the tone feature information of each sample object are predicted to obtain the sample object PnThe predicted audio data of (1).
s83, based on the sample object PnAnd the sample audio data YiAnd adjusting the candidate audio synthesis model to obtain the adjusted candidate audio synthesis model.
s84, determining the adjusted candidate audio synthesis model as the target audio synthesis model.
In order to achieve a customized timbre for the sample object in steps s 81-s 84, the computer device may be adapted to customize the timbre from the sample audio data YiAssociated object information is queried in a timbre database whether a sample object P is includednThe candidate tone color feature information of (1); if the sample object P is found in the tone color databasenThe candidate tone color feature information of (2), and the candidate tone color feature information is determined as the sample object PnThe tone color characteristic information of (1). If no sample object P is found in the timbre databasenThe candidate tone color feature information of (2), then, can be based on the sample object PnGenerate sample objects P of sample audio datanThe tone color feature information of (2), the sample object PnThe tone characteristic information of the sample object P is added into the tone database, so that the next regeneration of the sample object P can be avoidednThe tone characteristic information of the method can save resources. Further, the sample audio data Y may be subjected to candidate audio synthesis modelsiThe sample audio feature information and the tone feature information of each sample object are predicted to obtain the sample object PnThe predicted audio data of (1). If the above sample object PnThe predicted audio data and the sample audio data YiThe difference is large, which indicates that the audio synthesis accuracy of the candidate audio synthesis model is low; if the above sample object PnPredicted audio ofData and sample audio data YiThe small difference shows that the audio synthesis accuracy of the candidate audio synthesis model is high. Thus, the sample object P can be based on the abovenAnd the sample audio data YiAnd adjusting the candidate audio synthesis model to obtain the adjusted candidate audio synthesis model. And determining the adjusted candidate audio synthesis model as a target audio synthesis model. By using sample audio data YiThe sample audio characteristic information and the tone characteristic information of each sample object train the candidate audio synthesis model, so that tone customization is facilitated for the object, audio data can be synthesized for multiple persons, and robustness of the candidate audio synthesis model is improved.
Optionally, if no sample object P is found in the tone databasenThe computer device may generate the sample object PnThe candidate tone color feature information of (1); specifically, the step s81 may include the following steps s91 to s 93:
s91, based on the sample audio data YiAssociated object information specifying the object belonging to the sample object PnThe sample audio data of (1).
s92, pertaining to said sample object PnSample audio segments are extracted from the sample audio data.
s93, extracting tone color feature of the sample audio clip to obtain the sample object PnThe tone color characteristic information of (1).
In steps s 91-s 93, if no sample object P is found in the timbre databasenThe computer device may be configured to obtain the candidate timbre feature information of the sample audio data YiAssociated object information, determining from at least two sample audio data the object P belonging to said samplenThe sample audio data of (1). Then, subject to the above sample object PnRandomly extracting sample audio clips from the sample audio data; extracting tone characteristic of the sample audio frequency fragment to obtain the sample object PnThe tone color characteristic information of (1). By a small number of sample tonesThe data determines the tone characteristic information of the sample object, so that the tone of the sample object is customized, the cost can be saved, and the efficiency of generating the tone characteristic information of the sample object is improved.
Optionally, the step s82 may include the following steps s94 to s 95:
s94, based on the sample audio data YiDetermining the sample audio data YiThe frame length of the pronunciation unit.
s95 applying the candidate audio synthesis model to the sample audio data YiSample audio feature information of (1), the sample audio data YiAnd the frame length of the sound generation unit and the sample object PnThe timbre characteristic information of the sample object P is predicted to obtain the sample object PnThe predicted audio data of (1).
In steps s 94-s 95, the computer device may apply the duration model to the sample audio data YiPredicting the sample audio characteristic information to obtain the sample audio data YiFrame length of the pronunciation unit, i.e. the above sample audio data YiThe frame length of the pronunciation unit is the sample audio data YiThe duration of the pronunciation unit (i.e., pronunciation duration). Further, the sample audio data Y may be subjected to candidate audio synthesis modelsiSample audio feature information of (1), the sample audio data YiAnd the frame length of the sound generation unit and the sample object PnThe timbre characteristic information of the sample object P is predicted to obtain the sample object PnThe predicted audio data of (1). By obtaining the above sample audio data YiThe frame length of the pronunciation unit(s) of (1) is advantageous for bringing the audio data synthesized for the sample object (i.e., predicted audio data) closer to the singing style of the sample object.
It should be noted that, here, the sample audio data Y is described aboveiThe sample audio feature information may refer to the sample audio data YiThe text characteristic information and the music score characteristic information are obtained after normalization processing, residual coding and the like; this is advantageous for improving the time duration model prediction sample audio data YiThe pronunciation noteThe accuracy of the frame length of the element, and further, the audio data synthesized by the candidate audio synthesis model is closer to the singing style of the sample object, so that personalized customization is realized.
It should be noted that the candidate audio synthesis model may refer to a 24KHz vocoder of LPC (Linear Predictive Coding) Net, WaveRNN (waveform neural network model), and the like, which is not limited in this application. The LPCNet of 24KHz is used to synthesize audio data of 24KHz, 24KHz audio data is divided into 24 frequency bands, and 24-order LPC coefficients are employed to predict an audio data waveform. The 24KHz LPCnet vocoder is more stable and higher definition than WaveRNN sound quality under the condition of equal calculation amount.
Optionally, the step s83 may include the following steps s96 to s 98:
s96 obtaining the sample object PnThe predicted audio data and the sample audio data YiThe similarity between them.
s97, determining the prediction error of the candidate audio synthesis model according to the similarity.
s98, if the prediction error of the candidate audio synthesis model is not in a convergence state, adjusting the candidate audio synthesis model according to the prediction error of the candidate audio synthesis model to obtain the adjusted candidate audio synthesis model.
In steps s96 to s98, the computer device may compare the sample object P with the reference valuenThe predicted audio data and the sample audio data YiComparing to determine the sample object PnThe predicted audio data and the sample audio data YiThe similarity between them. If the similarity is larger, it indicates that the synthesized predicted audio data and sample audio data Y are synthesizediThe difference between the candidate audio synthesis models is small, and the prediction error of the candidate audio synthesis models is low; conversely, if the degree of similarity is smaller, it indicates that the synthesized predicted audio data and the sample audio data Y are combinediThe difference between the candidate audio synthesis models is relatively large, and the prediction error of the candidate audio synthesis models is relatively high. Therefore, the computer device may determine the prediction error of the candidate audio synthesis model according to the similarity, if the candidate audio synthesis model is similar to the candidate audio synthesis modelThe prediction error of the selected audio synthesis model is in a convergence state, which shows that the prediction error of the candidate audio synthesis model is smaller, and the candidate audio synthesis model can be determined as the target audio synthesis model. If the prediction error of the candidate audio synthesis model is not in a convergence state, the prediction error of the candidate audio synthesis model is relatively large; therefore, the candidate audio synthesis model is adjusted according to the prediction error of the candidate audio synthesis model, and the adjusted candidate audio synthesis model is obtained. The candidate audio synthesis model is adjusted according to the prediction error of the candidate audio synthesis model, so that the accuracy of synthesizing audio by the target candidate audio synthesis model is improved, and the synthesized audio data is closer to the singing style of the target object.
In an embodiment, the determining the prediction error of the candidate audio synthesis model according to the similarity may specifically include: and accumulating the corresponding similarity of each sample audio data in the at least two sample audio data to obtain the total similarity, and determining the prediction error of the candidate audio synthesis model according to the total similarity. As shown in fig. 8, the computer device may input the audio feature information of the first sample audio data and the timbre feature information of the first sample object to which the first sample audio data belongs to the candidate audio synthesis model, generate predicted audio data (i.e., synthesized audio data) of the first sample object, obtain a similarity between the first sample audio data and the predicted audio data of the first sample object, and calculate a similarity corresponding to other sample audio data in the at least two audio data. And then accumulating the similarity of each sample audio data to obtain total similarity, determining the prediction error of the candidate audio synthesis model according to the total similarity, determining that the prediction error of the candidate audio synthesis model is in a convergence state if the prediction error of the candidate audio synthesis model is smaller than an error threshold value, and determining the candidate audio synthesis model as a target audio synthesis model. And if the prediction error of the candidate audio synthesis model is larger than or equal to the error threshold, determining that the prediction error of the candidate audio synthesis model is not in a convergence state, adjusting the candidate audio synthesis model according to the prediction error of the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model, and determining the adjusted candidate audio synthesis model as the target audio synthesis model.
S205, acquiring a reference audio fragment of the target object, and text characteristic information and score characteristic information of target audio data to be synthesized.
And S206, generating the tone characteristic information of the target object according to the reference audio clip of the target object.
And S207, carrying out normalization processing on the text characteristic information and the music score characteristic information of the target audio data to obtain the audio characteristic information of the target audio data.
And S208, synthesizing the audio characteristic information of the target audio data and the tone characteristic information of the target object by adopting the target audio synthesis model to obtain the target audio data belonging to the target object.
In steps S205 to S208, after acquiring the target audio synthesis model, the computer device may synthesize audio data for the target object using the target audio synthesis model. Specifically, the computer device may obtain a reference audio segment of the target object, and text feature information and score feature information of target audio data to be synthesized, and perform tone extraction on the reference audio segment of the target object to obtain tone feature information of the target object; the tone information of the target object refers to information reflecting the pronunciation style of the target object, and the target object may refer to a user or a virtual user. Further, the text characteristic information and the score characteristic information of the target audio data are normalized to obtain audio characteristic information of the target audio data, and the audio characteristic information of the target audio data and the tone characteristic information of the target object are synthesized by using the target audio synthesis model to obtain target audio data belonging to the target object. By normalizing the text characteristic information and the music score characteristic information of the target audio data, the distribution difference of the text characteristic information and the music score characteristic information of the target audio data can be reduced, and the pronunciation stability of the synthesized target audio data can be improved. And target audio data is synthesized through the reference audio segments of the target object, so that the tone customization of the target object can be realized by adopting a small amount of reference audio data, the cost of synthesizing the audio data can be reduced, and the quality of synthesizing the audio data can be improved.
According to the method and the device, the text characteristic information and the music score characteristic information of the target audio data are subjected to normalization processing, so that the distribution difference of the text characteristic information and the music score characteristic information of the target audio data can be reduced, and the pronunciation stability of the synthesized target audio data can be improved. And target audio data is synthesized through the reference audio segments of the target object, so that the tone customization of the target object can be realized by adopting a small amount of reference audio data, the cost of synthesizing the audio data can be reduced, and the quality of synthesizing the audio data can be improved.
Fig. 9 is a schematic structural diagram of an audio data processing apparatus 1 according to an embodiment of the present application. The audio data processing apparatus 1 may be a computer program (including program code) running in a computer device, for example, the audio data processing apparatus 1 is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 9, the audio data processing apparatus 1 may include: an acquisition module 901, an extraction module 902, a processing module 903, an adjustment module 904, and a composition module 905.
An obtaining module, configured to obtain at least two sample audio data and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiTo which a sample object P belongsnThe at least two sample audio data belong to at least two sample objects, i is less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is less than or equal to Q, Q is the number of objects in the at least two sample objects;
an extraction module for audio-frequency samplingData YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
a processing module for processing the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
an adjustment module for employing the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
Optionally, the processing module is configured to process the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiThe sample audio feature information of (1), comprising:
for the sample audio data YiPerforming splicing processing on the text characteristic information and the music score characteristic information to obtain spliced characteristic information;
normalizing the spliced characteristic information to obtain the sample audio data YiThe sample audio feature information of (1).
Optionally, the processing module performs normalization processing on the spliced feature information to obtain the sample audio data YiThe sample audio feature information of (1), comprising:
normalizing the spliced characteristic information to obtain the sample audio data YiThe candidate sample audio feature information of (1);
for the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiThe audio coding value of (a);
sample audio data YiIs determined as the sample audio data YiOf (2) a sampleAudio feature information.
Optionally, the processing module is configured to process the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiComprises:
for the sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation to obtain the sample audio data YiFrequency domain feature information of (1);
according to the sample audio data YiGenerates the sample audio data YiEnergy characteristic information of (a);
for the sample audio data YiThe energy characteristic information of the audio signal is encoded to obtain the sample audio data YiThe audio coding value of (1).
Optionally, the processing module is configured to process the sample audio data YiThe energy characteristic information of the audio signal is encoded to obtain the sample audio data YiComprises:
for the sample audio data YiFiltering the energy characteristic information to obtain the sample audio data YiEffective energy characteristic information of (1);
for the sample audio data YiThe effective energy characteristic information is subjected to discretization processing to obtain the sample audio data YiDiscrete energy characteristic information of (a);
for the sample audio data YiThe discrete energy characteristic information of the audio signal is encoded to obtain the sample audio data YiThe audio coding value of (1).
Optionally, the processing module is configured to process the sample audio data YiThe discrete energy characteristic information of the audio signal is encoded to obtain the sample audio data YiComprises:
for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre characteristic coding value of;
audio the sampleData YiThe discrete energy characteristic information is subjected to attention coding to obtain the audio data Y for reflecting the sampleiThe context feature encoding value of (1);
the sample audio data YiThe timbre characteristic coding value and the relation characteristic coding value are spliced to obtain the sample audio data YiThe audio coding value of (1).
Optionally, the processing module is configured to process the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre feature encoding values of (1), comprising:
for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe candidate timbre feature coding value of (1);
according to the sample audio data YiAssociated object information for the sample audio data YiThe candidate tone characteristic coding value is compensated to obtain the sample audio data YiCompensating tone color feature coding values;
the sample audio data YiIs determined as the sample audio data YiThe timbre characteristic encoding values of (1).
Optionally, the processing module relation feature coding value is obtained by applying an attention coder to the sample audio data YiThe discrete energy characteristic information is obtained by attention coding, and the method further comprises:
applying a gradient inversion layer to the sample audio data YiThe relation characteristic coding value is transmitted to the object classification layer;
applying the object classifier to the sample audio data YiThe relation characteristic coding value of (2) is identified to obtain the sample object PnA category of (1);
according to the sample object PnDetermines a gradient error of the attention encoder;
weighting the gradient error of the attention encoder by adopting a reverse coefficient to obtain the coding error of the attention encoder, and adjusting the attention encoder according to the coding error.
Optionally, the adjusting module adopts the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model, and the method comprises the following steps:
according to the sample audio data YiAssociated object information, obtaining the sample object PnTone color feature information of;
applying a candidate audio synthesis model to the sample audio data YiPredicting the sample audio characteristic information and the tone characteristic information of each sample object to obtain the sample object PnThe predicted audio data of (1);
according to the sample object PnAnd the sample audio data YiAdjusting the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model;
and determining the adjusted candidate audio synthesis model as a target audio synthesis model.
Optionally, the adjusting module is configured to adjust the audio frequency according to the sample audio data YiAssociated object information, obtaining the sample object PnThe tone color feature information of (1), comprising:
according to the sample audio data YiAssociated object information determining the object P belonging to the samplenThe sample audio data of (1);
subject to the sample object PnExtracting a sample audio clip from the sample audio data;
extracting tone characteristic of the sample audio clip to obtain the sample object PnThe tone color characteristic information of (1).
Optionally, the adjusting module applies a candidate audio synthesis model to the sample audio data YiAnd the respective sample objectThe timbre characteristic information of the sample object P is predicted to obtain the sample object PnThe predicted audio data of (1), comprising:
according to the sample audio data YiDetermining the sample audio data YiThe frame length of the pronunciation unit;
applying a candidate audio synthesis model to the sample audio data YiSample audio feature information of (1), the sample audio data YiAnd the frame length of the pronunciation unit and the sample object PnThe timbre characteristic information of the sample object P is predicted to obtain the sample object PnThe predicted audio data of (1).
Optionally, the adjusting module is configured to adjust the sample object P according to the sample objectnAnd the sample audio data YiAdjusting the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model, including:
obtaining the sample object PnThe predicted audio data and the sample audio data YiThe similarity between them;
determining a prediction error of the candidate audio synthesis model according to the similarity;
and if the prediction error of the candidate audio synthesis model is not in a convergence state, adjusting the candidate audio synthesis model according to the prediction error of the candidate audio synthesis model to obtain the adjusted candidate audio synthesis model.
Optionally, the obtaining module is further configured to obtain a reference audio fragment of the target object, and text characteristic information and score feature information of target audio data to be synthesized;
optionally, the extracting module is further configured to generate tone characteristic information of the target object according to the reference audio segment of the target object;
optionally, the processing module is further configured to perform normalization processing on the text characteristic information and the score characteristic information of the target audio data to obtain audio characteristic information of the target audio data;
optionally, the apparatus further comprises:
and the synthesis module is used for synthesizing the audio characteristic information of the target audio data and the tone characteristic information of the target object by adopting the target audio synthesis model to obtain the target audio data belonging to the target object.
According to an embodiment of the present application, the steps involved in the audio data processing method shown in fig. 4 may be performed by respective modules in the audio data processing apparatus shown in fig. 9. For example, step S101 shown in fig. 4 may be performed by the obtaining module 901 in fig. 9, and step S102 shown in fig. 4 may be performed by the extracting module 902 in fig. 9; step S103 shown in fig. 4 may be performed by the check processing block 903 in fig. 9; step S104 shown in fig. 4 may be performed by the adjusting module 904 in fig. 9.
According to an embodiment of the present application, the modules in the audio data processing apparatus shown in fig. 9 may be respectively or entirely combined into one or several units to form the unit, or some unit(s) may be further split into at least two sub-units with smaller functions, which may implement the same operation without affecting implementation of technical effects of the embodiment of the present application. The modules are divided based on logic functions, and in practical applications, the functions of one module may also be implemented by at least two units, or the functions of at least two modules may also be implemented by one unit. In other embodiments of the present application, the audio data processing device may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by at least two units in cooperation.
According to an embodiment of the present application, the audio data processing apparatus as shown in fig. 9 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method as shown in fig. 4 on a general-purpose computer device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and the audio data processing method of the embodiment of the present application may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded into and executed by the computing apparatus via the computer-readable recording medium.
In the application, by acquiring at least two sample audio data and extracting text characteristic information and score characteristic information of each sample audio data in the at least two sample audio data, the pronunciation of the synthesized audio data is easy to be unstable due to the difference in the distribution of the text characteristic information and the score characteristic information of the sample audio data; therefore, the audio characteristic information of each sample audio data is obtained by normalizing the text characteristic information and the music score characteristic information of each sample audio data, which is beneficial to reducing the distribution difference of the text characteristic and the music score characteristic information of the sample audio data and further improving the pronunciation stability of the synthesized audio data. Further, the above sample audio data Y may be employediSample audio feature information and the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the at least two sample audio data belong to at least two sample objects, namely, the candidate audio model is trained by adopting the sample audio data of the plurality of sample objects, so that the diversity of training corpora is improved, the robustness of the target audio synthesis model is improved, and the problem of instability at high pitch, low pitch and lingering is avoided.
Fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 10, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer apparatus 1000 may further include: an object interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The object interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the selectable object interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., W)I-FIAn interface). The memory 1005 may be a high speed RAM memory or may be a high speed RAM memoryA non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 10, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, an object interface module, and a device control application program.
In the computer device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; the object interface 1003 is an interface for providing input to an object; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
obtaining at least two sample audio data, and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, Q is the number of objects in the at least two sample objects;
for the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
for the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
using the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
Alternatively, processor 1001 may be used to invoke storageA device control application stored in the processor 1005 to implement the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiThe sample audio feature information of (1), comprising:
for the sample audio data YiPerforming splicing processing on the text characteristic information and the music score characteristic information to obtain spliced characteristic information;
normalizing the spliced characteristic information to obtain the sample audio data YiThe sample audio feature information of (1).
Optionally, the processor 1001 may be configured to call a device control application program stored in the memory 1005, so as to perform normalization processing on the spliced feature information to obtain the sample audio data YiThe sample audio feature information of (1), comprising:
normalizing the spliced characteristic information to obtain the sample audio data YiThe candidate sample audio feature information of (1);
for the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiThe audio coding value of (a);
sample audio data YiIs determined as the sample audio data YiThe sample audio feature information of (1).
Optionally, the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiComprises:
for the sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation to obtain the sample audio data YiFrequency domain feature information of (1);
according to the sample audio data YiGenerates the sample audio data YiEnergy characteristic information of (a);
for the sample audio data YiThe energy characteristic information of the audio signal is encoded to obtain the sample audio data YiThe audio coding value of (1).
Optionally, the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement the sample audio data YiThe energy characteristic information of the audio signal is encoded to obtain the sample audio data YiComprises:
for the sample audio data YiFiltering the energy characteristic information to obtain the sample audio data YiEffective energy characteristic information of (1);
for the sample audio data YiThe effective energy characteristic information is subjected to discretization processing to obtain the sample audio data YiDiscrete energy characteristic information of (a);
for the sample audio data YiThe discrete energy characteristic information of the audio signal is encoded to obtain the sample audio data YiThe audio coding value of (1).
Optionally, the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement the sample audio data YiThe discrete energy characteristic information of the audio signal is encoded to obtain the sample audio data YiComprises:
for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre characteristic coding value of;
for the sample audio data YiThe discrete energy characteristic information is subjected to attention coding to obtain the audio data Y for reflecting the sampleiThe context feature encoding value of (1);
the sample audio data YiThe timbre characteristic coding value and the relation characteristic coding value are spliced to obtain the sample audio data YiThe audio coding value of (1).
Optionally, atProcessor 1001 may be configured to invoke a device control application stored in memory 1005 to implement the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre feature encoding values of (1), comprising:
for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe candidate timbre feature coding value of (1);
according to the sample audio data YiAssociated object information for the sample audio data YiThe candidate tone characteristic coding value is compensated to obtain the sample audio data YiCompensating tone color feature coding values;
the sample audio data YiIs determined as the sample audio data YiThe timbre characteristic encoding values of (1).
Optionally, the relation feature encoding value is obtained by applying an attention encoder to the sample audio data YiAttention-coded, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:
applying a gradient inversion layer to the sample audio data YiThe relation characteristic coding value is transmitted to the object classification layer;
applying the object classifier to the sample audio data YiThe relation characteristic coding value of (2) is identified to obtain the sample object PnA category of (1);
according to the sample object PnDetermines a gradient error of the attention encoder;
weighting the gradient error of the attention encoder by adopting a reverse coefficient to obtain the coding error of the attention encoder, and adjusting the attention encoder according to the coding error.
Alternatively, the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement the useThe sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model, and the method comprises the following steps:
according to the sample audio data YiAssociated object information, obtaining the sample object PnTone color feature information of;
applying a candidate audio synthesis model to the sample audio data YiPredicting the sample audio characteristic information and the tone characteristic information of each sample object to obtain the sample object PnThe predicted audio data of (1);
according to the sample object PnAnd the sample audio data YiAdjusting the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model;
and determining the adjusted candidate audio synthesis model as a target audio synthesis model.
Optionally, the processor 1001 may be configured to call a device control application stored in the memory 1005 to implement the method according to the sample audio data YiAssociated object information, obtaining the sample object PnThe tone color feature information of (1), comprising:
according to the sample audio data YiAssociated object information determining the object P belonging to the samplenThe sample audio data of (1);
subject to the sample object PnExtracting a sample audio clip from the sample audio data;
extracting tone characteristic of the sample audio clip to obtain the sample object PnThe tone color characteristic information of (1).
Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement the candidate audio synthesis model for the sample audio data YiPredicting the sample audio characteristic information and the tone characteristic information of each sample object to obtain the sample object PnThe predicted audio data of (1), comprising:
according to the sample audio data YiDetermining the sample audio data YiThe frame length of the pronunciation unit;
applying a candidate audio synthesis model to the sample audio data YiSample audio feature information of (1), the sample audio data YiAnd the frame length of the pronunciation unit and the sample object PnThe timbre characteristic information of the sample object P is predicted to obtain the sample object PnThe predicted audio data of (1).
Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement the method according to the sample object PnAnd the sample audio data YiAdjusting the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model, including:
obtaining the sample object PnThe predicted audio data and the sample audio data YiThe similarity between them;
determining a prediction error of the candidate audio synthesis model according to the similarity;
and if the prediction error of the candidate audio synthesis model is not in a convergence state, adjusting the candidate audio synthesis model according to the prediction error of the candidate audio synthesis model to obtain the adjusted candidate audio synthesis model.
Optionally, the processor 1001 may be configured to invoke a device control application stored in the memory 1005 to implement:
acquiring a reference audio fragment of a target object, and text characteristic information and music score characteristic information of target audio data to be synthesized;
generating tone characteristic information of the target object according to the reference audio segment of the target object;
performing normalization processing on the text characteristic information and the music score characteristic information of the target audio data to obtain audio characteristic information of the target audio data;
and synthesizing the audio characteristic information of the target audio data and the tone characteristic information of the target object by adopting the target audio synthesis model to obtain the target audio data belonging to the target object.
In the application, by acquiring at least two sample audio data and extracting text characteristic information and score characteristic information of each sample audio data in the at least two sample audio data, the pronunciation of the synthesized audio data is easy to be unstable due to the difference in the distribution of the text characteristic information and the score characteristic information of the sample audio data; therefore, the audio characteristic information of each sample audio data is obtained by normalizing the text characteristic information and the music score characteristic information of each sample audio data, which is beneficial to reducing the distribution difference of the text characteristic and the music score characteristic information of the sample audio data and further improving the pronunciation stability of the synthesized audio data. Further, the above sample audio data Y may be employediSample audio feature information and the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the at least two sample audio data belong to at least two sample objects, namely, the candidate audio model is trained by adopting the sample audio data of the plurality of sample objects, so that the diversity of training corpora is improved, the robustness of the target audio synthesis model is improved, and the problem of instability at high pitch, low pitch and lingering is avoided.
It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the audio data processing method in the embodiment corresponding to fig. 4 and fig. 7, and may also perform the description of the audio data processing apparatus in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program executed by the aforementioned audio data processing apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the descriptions of the audio data processing method in the embodiments corresponding to fig. 4 and fig. 7 can be executed, so that the descriptions will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.
As an example, the program instructions described above may be executed on one computer device, or on at least two computer devices located at one site, or on at least two computer devices distributed over at least two sites and interconnected by a communication network, and the at least two computer devices distributed over at least two sites and interconnected by the communication network may constitute a blockchain network.
The computer readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (16)

1. A method of audio data processing, comprising:
obtaining at least two sample audio data, and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiBelonging to a sample object PnThe at least two sample audio data belong to at least two sample objects, i is a positive integer less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is a positive integer less than or equal to Q, Q is the number of objects in the at least two sample objects;
for the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
for the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
using the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
2. The method of claim 1, wherein the pair of the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiThe sample audio feature information of (1), comprising:
for the sample audio data YiPerforming splicing processing on the text characteristic information and the music score characteristic information to obtain spliced characteristic information;
normalizing the spliced characteristic information to obtain the sample audio data YiThe sample audio feature information of (1).
3. The method of claim 2, wherein the normalization processing is performed on the feature information after the splicing processing to obtain the sample audio data YiThe sample audio feature information of (1), comprising:
normalizing the spliced characteristic information to obtain the sample audio data YiThe candidate sample audio feature information of (1);
for the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiThe audio coding value of (a);
sample audio data YiIs determined as the sample audio data YiThe sample audio feature information of (1).
4. The method of claim 3, wherein the pair of the sample audio data YiThe candidate sample audio characteristic information is coded to obtain the sample audio data YiComprises:
for the sample audio data YiThe candidate sample audio characteristic information is subjected to frequency domain transformation to obtain the sample audio data YiFrequency domain feature information of (1);
according to the sample audio data YiGenerates the sample audio data YiEnergy characteristic information of (a);
for the sample audio data YiThe energy characteristic information of the audio signal is encoded to obtain the sample audio data YiThe audio coding value of (1).
5. The method of claim 4, wherein the pair of the sample audio data YiThe energy characteristic information of the audio signal is encoded to obtain the sample audio data YiComprises:
for the sample audio data YiFiltering the energy characteristic information to obtain the sample audio data YiEffective energy characteristic information of (1);
for the sample audio data YiThe effective energy characteristic information is subjected to discretization processing to obtain the sample audio data YiDiscrete energy characteristic information of (a);
for the sample audio data YiThe discrete energy characteristic information of the audio signal is encoded to obtain the sample audio data YiThe audio coding value of (1).
6. The method of claim 5, wherein the pair of the sample audio data YiThe discrete energy characteristic information of the audio signal is encoded to obtain the sample audio data YiComprises:
for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre characteristic coding value of;
for the sample audio data YiThe discrete energy characteristic information is subjected to attention coding to obtain the audio data Y for reflecting the sampleiThe context feature encoding value of (1);
the sample audio data YiThe timbre characteristic coding value and the relation characteristic coding value are spliced to obtain the sample audio data YiAudio coding value of。
7. The method of claim 6, wherein the pair of the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe timbre feature encoding values of (1), comprising:
for the sample audio data YiResidual coding is carried out on the discrete energy characteristic information to obtain the sample audio data YiThe candidate timbre feature coding value of (1);
according to the sample audio data YiAssociated object information for the sample audio data YiThe candidate tone characteristic coding value is compensated to obtain the sample audio data YiCompensating tone color feature coding values;
the sample audio data YiIs determined as the sample audio data YiThe timbre characteristic encoding values of (1).
8. The method of claim 7, wherein the relational feature encoding value is Y for the sample audio data with an attention encoderiThe discrete energy characteristic information is obtained by attention coding, and the method further comprises:
applying a gradient inversion layer to the sample audio data YiThe relation characteristic coding value is transmitted to the object classification layer;
applying the object classifier to the sample audio data YiThe relation characteristic coding value of (2) is identified to obtain the sample object PnA category of (1);
according to the sample object PnDetermines a gradient error of the attention encoder;
weighting the gradient error of the attention encoder by adopting a reverse coefficient to obtain the coding error of the attention encoder, and adjusting the attention encoder according to the coding error.
9. The method of any of claims 1-8, wherein said employing said sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model, and the method comprises the following steps:
according to the sample audio data YiAssociated object information, obtaining the sample object PnTone color feature information of;
applying a candidate audio synthesis model to the sample audio data YiPredicting the sample audio characteristic information and the tone characteristic information of each sample object to obtain the sample object PnThe predicted audio data of (1);
according to the sample object PnAnd the sample audio data YiAdjusting the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model;
and determining the adjusted candidate audio synthesis model as a target audio synthesis model.
10. The method of claim 9, wherein the sample audio data Y is based on the sample audio data YiAssociated object information, obtaining the sample object PnThe tone color feature information of (1), comprising:
according to the sample audio data YiAssociated object information determining the object P belonging to the samplenThe sample audio data of (1);
subject to the sample object PnExtracting a sample audio clip from the sample audio data;
extracting tone characteristic of the sample audio clip to obtain the sample object PnThe tone color characteristic information of (1).
11. The method of claim 9, wherein the employing the candidate audio synthesis model to the sample audio data YiAnd the sample audio feature information ofPredicting tone characteristic information of each sample object to obtain the sample object PnThe predicted audio data of (1), comprising:
according to the sample audio data YiDetermining the sample audio data YiThe frame length of the pronunciation unit;
applying a candidate audio synthesis model to the sample audio data YiSample audio feature information of (1), the sample audio data YiAnd the frame length of the pronunciation unit and the sample object PnThe timbre characteristic information of the sample object P is predicted to obtain the sample object PnThe predicted audio data of (1).
12. The method of claim 9, wherein said method is based on said sample object PnAnd the sample audio data YiAdjusting the candidate audio synthesis model to obtain an adjusted candidate audio synthesis model, including:
obtaining the sample object PnThe predicted audio data and the sample audio data YiThe similarity between them;
determining a prediction error of the candidate audio synthesis model according to the similarity;
and if the prediction error of the candidate audio synthesis model is not in a convergence state, adjusting the candidate audio synthesis model according to the prediction error of the candidate audio synthesis model to obtain the adjusted candidate audio synthesis model.
13. The method of claim 1, wherein the method further comprises:
acquiring a reference audio fragment of a target object, and text characteristic information and music score characteristic information of target audio data to be synthesized;
generating tone characteristic information of the target object according to the reference audio segment of the target object;
performing normalization processing on the text characteristic information and the music score characteristic information of the target audio data to obtain audio characteristic information of the target audio data;
and synthesizing the audio characteristic information of the target audio data and the tone characteristic information of the target object by adopting the target audio synthesis model to obtain the target audio data belonging to the target object.
14. An audio data processing apparatus, comprising:
an obtaining module, configured to obtain at least two sample audio data and sample audio data Y of the at least two sample audio dataiAssociated object information; the sample audio data YiAssociated object information for indicating the sample audio data YiTo which a sample object P belongsnThe at least two sample audio data belong to at least two sample objects, i is less than or equal to M, M is the number of sample audio data in the at least two sample audio data, n is less than or equal to Q, Q is the number of objects in the at least two sample objects;
an extraction module for extracting the sample audio data YiPerforming feature extraction to obtain the sample audio data YiText characteristic information and score characteristic information;
a processing module for processing the sample audio data YiThe text characteristic information and the music score characteristic information are normalized to obtain the sample audio data YiSample audio feature information of (1);
an adjustment module for employing the sample audio data YiAnd the sample audio data YiThe associated object information is used for adjusting the candidate audio synthesis model to obtain a target audio synthesis model; the target audio synthesis model is used for synthesizing target audio data of a target object.
15. A computer device, comprising: a processor and a memory;
the processor is connected with the memory; the memory is for storing program code, and the processor is for calling the program code to perform the method of any of claims 1 to 13.
16. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-13.
CN202110949163.4A 2021-08-18 2021-08-18 Audio data processing method, device, equipment and storage medium Pending CN114299909A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110949163.4A CN114299909A (en) 2021-08-18 2021-08-18 Audio data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110949163.4A CN114299909A (en) 2021-08-18 2021-08-18 Audio data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114299909A true CN114299909A (en) 2022-04-08

Family

ID=80964057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110949163.4A Pending CN114299909A (en) 2021-08-18 2021-08-18 Audio data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114299909A (en)

Similar Documents

Publication Publication Date Title
CN112687259B (en) Speech synthesis method, device and readable storage medium
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
CN112071330B (en) Audio data processing method and device and computer readable storage medium
JP2020034895A (en) Responding method and device
WO2021227707A1 (en) Audio synthesis method and apparatus, computer readable medium, and electronic device
CN112735373A (en) Speech synthesis method, apparatus, device and storage medium
US20230230571A1 (en) Audio processing method and apparatus based on artificial intelligence, device, storage medium, and computer program product
CN112837669B (en) Speech synthesis method, device and server
WO2022142850A1 (en) Audio processing method and apparatus, vocoder, electronic device, computer readable storage medium, and computer program product
CN115602165B (en) Digital employee intelligent system based on financial system
CN114627856A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
WO2022156479A1 (en) Custom tone and vocal synthesis method and apparatus, electronic device, and storage medium
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN116798405B (en) Speech synthesis method, device, storage medium and electronic equipment
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
CN117373431A (en) Audio synthesis method, training method, device, equipment and storage medium
CN115132195B (en) Voice wakeup method, device, equipment, storage medium and program product
CN113421554B (en) Voice keyword detection model processing method and device and computer equipment
CN114299909A (en) Audio data processing method, device, equipment and storage medium
CN113744759A (en) Tone template customizing method and device, equipment, medium and product thereof
JP7497523B2 (en) Method, device, electronic device and storage medium for synthesizing custom timbre singing voice
CN117935770A (en) Synthetic voice adjusting method, training method and related device
CN116364099A (en) Tone color conversion method, device, electronic apparatus, storage medium, and program product
CN116129938A (en) Singing voice synthesizing method, singing voice synthesizing device, singing voice synthesizing equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40071963

Country of ref document: HK