CN112735454A - Audio processing method and device, electronic equipment and readable storage medium - Google Patents
Audio processing method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN112735454A CN112735454A CN202011613263.1A CN202011613263A CN112735454A CN 112735454 A CN112735454 A CN 112735454A CN 202011613263 A CN202011613263 A CN 202011613263A CN 112735454 A CN112735454 A CN 112735454A
- Authority
- CN
- China
- Prior art keywords
- audio
- pronunciation
- information compensation
- data
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003860 storage Methods 0.000 title claims abstract description 13
- 238000003672 processing method Methods 0.000 title abstract description 12
- 238000005070 sampling Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims description 62
- 238000000034 method Methods 0.000 claims description 44
- 238000012549 training Methods 0.000 claims description 37
- 239000013598 vector Substances 0.000 claims description 36
- 238000004590 computer program Methods 0.000 claims description 13
- 230000009467 reduction Effects 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 9
- 241000282414 Homo sapiens Species 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013144 data compression Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011946 reduction process Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the invention provides an audio processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of computers. According to the embodiment of the invention, the information compensation model trained based on the original audio sample has better information compensation capability, and when the trained information compensation model is used for carrying out information compensation on the audio to be processed, the similarity between the compensated part in the target audio and the real sound can be higher, so that the true degree of the target audio is higher, namely, the trained information compensation model has higher up-sampling accuracy.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a readable storage medium.
Background
Currently, audio processing can be applied in various scenarios, such as audio processing of machine-synthesized speech (online education, video dubbing, commentary, etc.), and in practical applications, common audio processing includes audio data compression and audio data restoration.
However, in the process of audio data compression and audio data restoration, data loss is often generated on the audio data, and the accuracy of audio data restoration is reduced.
Disclosure of Invention
In view of this, embodiments of the present invention provide an audio processing method, an audio processing apparatus, an electronic device, and a readable storage medium, so that an information compensation model has a better information compensation capability and a higher up-sampling accuracy.
In a first aspect, an audio processing method is provided, where the method is applied to an electronic device, and the method includes:
and acquiring audio to be processed.
And inputting the audio to be processed into a pre-trained information compensation model for processing so as to obtain the target audio.
Wherein the information compensation model is trained based on the following steps:
obtaining a training set, wherein the training set comprises a plurality of sample groups, and the sample groups comprise first audio samples subjected to dimensionality reduction and original audio samples corresponding to the first audio samples.
And training the information compensation model according to the training set.
Optionally, the acquiring the audio to be processed includes:
raw audio data is acquired.
And performing down-sampling processing on the original audio data to obtain audio to be processed.
Optionally, the first audio sample includes preset noise data.
Optionally, the noise data comprises white noise and/or pink noise.
Optionally, the acquiring a training set includes:
a plurality of raw audio samples is obtained.
For an original audio sample, performing down-sampling processing on the original audio sample to obtain first audio data.
And combining a plurality of preset noise data with the first audio data respectively, and determining a plurality of corresponding first audio samples to obtain a plurality of sample groups corresponding to the original audio samples.
Optionally, the inputting the audio to be processed into a pre-trained information compensation model for processing to obtain a target audio includes:
and inputting the audio to be processed into a pre-trained information compensation model for up-sampling processing so as to determine the target audio.
Optionally, the acquiring the original audio data includes:
an input text is obtained.
Determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word.
And determining pronunciation time length and pronunciation tone corresponding to each pronunciation vector, wherein the pronunciation time length is used for representing the duration of pronunciation, and the pronunciation tone is used for representing the pitch of pronunciation.
And synthesizing original audio data corresponding to the input text based on the pronunciation vector, the pronunciation duration and the pronunciation tone.
Optionally, the pronunciation tones are dialect tones, and the dialect tones are used for representing the pitch of the dialect pronunciation.
Optionally, the information compensation model is constructed based on an autoregressive neural network or a generative countermeasure network.
In a second aspect, an audio processing apparatus is provided, where the apparatus is applied to an electronic device, and the apparatus includes:
the first acquisition module is used for acquiring the audio to be processed.
And the information compensation module is used for inputting the audio to be processed into a pre-trained information compensation model for processing so as to obtain the target audio.
Wherein the information compensation model is trained based on the following modules:
and the second obtaining module is used for obtaining a training set, wherein the training set comprises a plurality of sample groups, and each sample group comprises a first audio sample subjected to dimensionality reduction and an original audio sample corresponding to the first audio sample.
And the training module is used for training the information compensation model according to the training set.
Optionally, the first obtaining module is specifically configured to:
raw audio data is acquired.
And performing down-sampling processing on the original audio data to obtain audio to be processed.
Optionally, the first audio sample includes preset noise data.
Optionally, the noise data comprises white noise and/or pink noise.
Optionally, the second obtaining module is specifically configured to:
a plurality of raw audio samples is obtained.
For an original audio sample, performing down-sampling processing on the original audio sample to obtain first audio data.
And combining a plurality of preset noise data with the first audio data respectively, and determining a plurality of corresponding first audio samples to obtain a plurality of sample groups corresponding to the original audio samples.
Optionally, the information compensation module is specifically configured to:
and inputting the audio to be processed into a pre-trained information compensation model for up-sampling processing so as to determine the target audio.
Optionally, the first obtaining module is specifically further configured to:
an input text is obtained.
Determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word.
And determining pronunciation time length and pronunciation tone corresponding to each pronunciation vector, wherein the pronunciation time length is used for representing the duration of pronunciation, and the pronunciation tone is used for representing the pitch of pronunciation.
And synthesizing original audio data corresponding to the input text based on the pronunciation vector, the pronunciation duration and the pronunciation tone.
Optionally, the pronunciation tones are dialect tones, and the dialect tones are used for representing the pitch of the dialect pronunciation.
Optionally, the information compensation model is constructed based on an autoregressive neural network or a generative countermeasure network.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect.
According to the embodiment of the invention, the information compensation model trained based on the original audio sample has better information compensation capability, and when the trained information compensation model is used for carrying out information compensation on the audio to be processed, the similarity between the compensated part in the target audio and the real sound can be higher, so that the true degree of the target audio is higher, namely, the trained information compensation model has higher up-sampling accuracy.
Drawings
The above and other objects, features and advantages of the embodiments of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a diagram illustrating a dimension reduction process in a related art according to an embodiment of the present invention;
fig. 2 is a flowchart of an audio processing method according to an embodiment of the present invention;
FIG. 3 is a flow chart of another audio processing method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an audio processing method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a process for determining a first audio sample according to an embodiment of the present invention;
FIG. 6 is a flow chart of another audio processing method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of another audio processing apparatus according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
At present, audio processing can be applied to various scenarios, for example, audio processing is performed on machine-synthesized speech (online education, video dubbing, commentary, and the like), and after original audio data is obtained, in order to reduce the data calculation amount in the audio processing process, downsampling processing is often performed on the original audio data first, and then subsequent processing is performed, so that target audio data that can be played is obtained.
In the process, when the original audio data is subjected to down-sampling processing, the original audio data is subjected to dimension reduction processing, so that the data volume of the original audio data is reduced.
For example, as shown in fig. 1, fig. 1 is a schematic diagram of a dimension reduction process in the related art according to an embodiment of the present invention, where the schematic diagram includes: original audio 11, intermediate audio 12, to-be-played audio 13, and playback device 14.
Specifically, after the electronic device acquires the original audio 11, the original audio 11 may be down-sampled to determine the intermediate audio 12.
The electronic device may be a terminal device or a server, the terminal device may be a smart phone, a tablet Computer, a Personal Computer (PC), or the like, and the server may be a single server, a server cluster configured in a distributed manner, or a cloud server.
In this process, in order to ensure the audio processing process to be performed smoothly, the dimension of the original audio 11 needs to be reduced to a lower value, for example, the original audio 11 is audio data with a sampling rate of 22K × 16, and in order to enable the electronic device to perform audio processing on the original audio 11 more effectively, the original audio 11 needs to be down-sampled, and the intermediate audio 12 is determined (the intermediate audio 12 may be an 80-dimensional mel spectrum, that is, the intermediate audio 12 may be an 80-dimensional mel spectrum).
As shown in fig. 1, after the original audio 11 is down-sampled to obtain the intermediate audio 12, since the original audio 11 is data-compressed, the intermediate audio 12 may lose a part of data compared with the original audio 11 (for example, the intermediate audio 12 in fig. 1 loses a high-frequency part of the original audio 11).
After determining the intermediate audio 12, the electronic device may input the intermediate audio 12 into the vocoder, and after receiving the intermediate audio 12, the vocoder may perform upsampling processing on the intermediate audio 12 to obtain the audio to be played 13, and then the electronic device may play the audio to be played 13 through the playing device 14.
The playing device 14 may be an audio playing device installed in the electronic device, or an audio playing device externally connected to the electronic device, which is not limited in the embodiment of the present invention.
When the intermediate audio 12 is up-sampled by the vocoder, since the intermediate audio 12 loses a large amount of data compared with the original audio 11, the audio to be played 13 obtained after the up-sampling by the vocoder has a larger difference compared with the original audio 11, that is, the accuracy of the restored audio data is lower.
In order to improve the accuracy of up-sampling in the audio data processing process, an embodiment of the present invention provides an audio processing method, which is applied to an electronic device, and as shown in fig. 2, the method includes the following steps:
at step 21, the audio to be processed is acquired.
In the embodiment of the present invention, the audio to be processed may be audio data obtained after down-sampling processing, or may be audio data that is not subjected to down-sampling processing.
If the audio to be processed is audio data obtained after the down-sampling process, the down-sampling process may be performed as follows: the method comprises the steps of obtaining original audio data, conducting down-sampling processing on the original audio data, and obtaining audio to be processed.
In practical applications, the original audio data may be down-sampled by a specific tool, for example, the original audio may be down-sampled by FFmpeg (fast Forward mpeg), where FFmpeg is a set of open source computer programs that can be used to record and convert digital audio and video, and can convert them into streams, and the down-sampling of the original audio data may be realized based on the function of FFmpeg, and of course, the down-sampling may also be realized by other suitable tools, algorithms, models, and the like, which is not limited in this embodiment of the present invention.
According to the embodiment of the invention, the audio to be processed can be the audio data obtained after the down-sampling processing or the audio data without the down-sampling processing, so that the applicability of the embodiment of the invention is stronger, that is, a section of audio data can be supplemented with information through the embodiment of the invention no matter whether the down-sampling processing is performed or not.
In step 22, the audio to be processed is input to the pre-trained information compensation model for processing, so as to obtain the target audio.
In this case, the information compensation model may perform data compensation on the audio to be processed, and the data dimension of the compensated audio to be processed is promoted, that is, in an optional implementation manner, step 22 may be implemented as: and inputting the audio to be processed into a pre-trained information compensation model for up-sampling processing so as to determine the target audio.
In order to ensure the compensation capability of the information compensation model in step 22, in the embodiment of the present invention, the information compensation model needs to be trained, and as shown in fig. 3, the information compensation model is trained based on the following steps:
in step 31, a training set is obtained.
The training set comprises a plurality of sample groups, and the sample groups comprise the first audio samples after the dimension reduction processing and original audio samples corresponding to the first audio samples.
In the embodiment of the invention, because the information compensation model needs to accurately compensate the audio data, an accurate positive sample needs to be obtained in the training process, and furthermore, in the embodiment of the invention, the training of the information compensation model can be well supervised by taking the original audio sample as the positive sample, and finally, the information compensation model can accurately perform information compensation.
At step 32, a model is trained based on the training set.
According to the embodiment of the invention, the information compensation model trained based on the original audio sample has better information compensation capability, and when the trained information compensation model is used for carrying out information compensation on the audio to be processed, the similarity between the compensated part in the target audio and the real sound can be higher, so that the true degree of the target audio is higher, namely, the trained information compensation model has higher up-sampling accuracy.
For better explanation, the embodiment of the present invention provides a schematic diagram of the above-mentioned audio processing method, as shown in fig. 4, the schematic diagram includes: to-be-processed audio 41, target audio 42, information compensation model 43, and loss function 44.
As can be seen from fig. 4, audio data of a high-frequency portion is missing in the audio to be processed 41, where the audio to be processed 41 may be audio data obtained after downsampling or audio data without downsampling.
After the electronic device acquires the audio 41 to be processed, the information compensation model 43 may perform information compensation on the audio 41 to be processed to determine the target audio 42, specifically, in fig. 4, the information compensation model 43 may generate the target audio 42 including a low frequency part and a high frequency part based on the low frequency part of the audio 41 to be processed, and of course, in other applicable cases, the information compensation model 43 may also generate the target audio including a low frequency part and a high frequency part based on the high frequency part of the audio 41 to be processed, which is not limited in the embodiment of the present invention.
That is, in the embodiment of the present invention, the information compensation model may be a generation model of audio data, and specifically, the information compensation model may be constructed based on an autoregressive neural network (WaveNet) or a generative countermeasure network (melgan).
Among them, WaveNet is a probabilistic autoregressive model that can predict the probability distribution of a current audio sample based on all samples that have been generated before, that is, WaveNet can predict the probability distribution of a target audio based on an audio to be processed.
The melan is a generator countermeasure model based on a gan network, and comprises a generator and a discriminator, wherein the generator can be used for generating audio data, the discriminator can be used for discriminating whether the audio data generated by the melan is real data or not in the training process, and then model parameters are adjusted according to the judgment result.
In addition, in fig. 4, when the information compensation model 43 is trained (i.e. the data flow indicated by the dashed arrow in fig. 4), the loss between the target audio 42 and the positive sample may be calculated by the loss function 44, and then the information compensation model 43 is adjusted according to the loss, where the loss function 44 may be a cross entropy function, a back propagation algorithm, or the like, and the loss function 44 is not limited in the embodiment of the present invention.
Through the training of the information compensation model in the above steps 31 to 32, the information compensation model has a better information compensation capability, some noise interference inevitably exists in the actual audio processing process, and in the related art, the anti-noise capability of a common vocoder is poor, that is, the quality of audio data after being up-sampled by the vocoder in the related art is poor.
In order to improve the anti-noise capability of the information compensation model in the embodiment of the invention, preset noise data can be added into the training samples to improve the anti-noise capability of the information compensation model.
In an implementation manner, the first audio sample in the training set may include preset noise data, where the noise data may be white noise and/or pink noise, or other noise, which is not described in detail in this embodiment of the present invention.
White noise (white noise) refers to noise whose power spectral density is constant over the entire frequency domain. Pink noise refers to noise of equal intensity for each octave, i.e., pink noise has the same or similar energy over a range (octave).
Specifically, in an implementation manner, the step 31 may be implemented as: the method comprises the steps of obtaining a plurality of original audio samples, conducting down-sampling processing on the original audio samples to obtain first audio data, combining a plurality of preset noise data with the first audio data respectively, and determining a plurality of corresponding first audio samples to obtain a plurality of sample groups corresponding to the original audio samples.
For example, as shown in fig. 5, fig. 5 is a schematic diagram of a process for determining a first audio sample according to an embodiment of the present invention, where the schematic diagram includes: noise data a, noise data B, noise data C, first audio data 51 and a plurality of first audio samples 52.
After the electronic device acquires an original audio sample, first audio data 51 corresponding to the original audio can be determined through a down-sampling process, and for the first audio data 51, the electronic device can combine preset noise data with the first audio data 51 to determine a plurality of first audio samples 52.
Specifically, as shown in the figure, after the electronic device determines the first audio data 51, the preset noise data a, noise data B and noise data C may be combined with the first audio data 51 to determine a plurality of first audio samples 52, where the first audio data 51 may be combined with either one noise data or a plurality of noise data, that is, when the first audio data 51 and at least one noise data in fig. 5 are combined, 7 first audio samples 52 may be determined, specifically including: 51+ A + B + C, 51+ A + B, 51+ B + C, 51+ A, 51+ B, and 51+ C, where "51" is used to characterize the first audio data 51, "A" is used to characterize the noise data A, "B" is used to characterize the noise data B, and "C" is used to characterize the noise data C.
After determining the plurality of first audio samples 52, the electronic device may determine a plurality of sample groups corresponding to the original audio samples based on the plurality of first audio samples 52, and then train the information compensation model according to the plurality of sample groups.
According to the embodiment of the invention, the preset noise data is added into the training sample, so that the information compensation model has better noise removal capability in the training process, and after the training of the information compensation model is finished, the information compensation model can effectively remove the noise in the audio to be processed, so that the target audio has higher tone quality, namely, the trained information compensation model has better anti-noise capability.
In the embodiment of the invention, the trained information compensation model can perform high-quality information compensation on the audio to be processed to acquire the target audio, wherein the audio to be processed can be directly acquired audio data or audio data obtained by performing down-sampling processing on original audio data.
Specifically, as shown in fig. 6, the original audio data may be obtained by performing speech synthesis through an electronic device, and may be determined through the following steps:
at step 61, input text is obtained.
In practical applications, the speech synthesis may determine the pronunciation corresponding to each character for the content of the characters in at least one segment of the input text, and then synthesize a continuous speech segment according to the pronunciation of each character. Wherein the input text comprises at least one word.
At step 62, a pronunciation vector for at least one word of the input text is determined.
The pronunciation vector at least comprises prosodic information corresponding to a Word, the pronunciation vector can be used for representing embedding (embedding) of at least one Word in an input text, and the prosodic information can be used for representing a pause duration after the corresponding Word, wherein the embedding is a feature extraction means commonly used in deep learning, specifically, the feature extraction is to map high-dimensional original data (images, words and the like) to a low-dimensional Manifold (Manifold) so that the high-dimensional original data can be separated after being mapped to the low-dimensional Manifold, the mapping process can be called embedding, for example, Word embedding is to map a sentence consisting of words to a representation vector, and in the embodiment of the invention, the object of the embedding is a Word in the input text.
In one possible implementation, step 62 may be performed as: based on the preset corresponding relation between the characters and the pinyin, determining pinyin information of at least one character in the input text, and performing vectorization processing on the pinyin information to determine a pronunciation vector of the pinyin information.
Specifically, in the embodiment of the present invention, the correspondence between the characters and the pinyins may be preset based on a tool such as a dictionary, and after receiving the input text, the pinyin corresponding to each character may be determined for each character in the input text, then the pinyin of each character is subjected to Embedding processing, the feature vector of each pinyin is determined, and then the feature vector is used as the pronunciation vector of the corresponding character.
In step 63, the pronunciation duration and pronunciation pitch corresponding to each pronunciation vector are determined.
Wherein, the pronunciation duration is used for representing the duration of pronunciation, and the pronunciation tone is used for representing the pitch of pronunciation.
In the embodiment of the present invention, the pronunciation duration may be predicted based on a pronunciation duration prediction model with a Length adjuster (Length Regulator), where the Length adjuster may be configured to solve a Length mismatch problem between the phonemes and the spectrogram sequence, and based on the Length adjuster, the model may accurately predict a duration corresponding to each phoneme.
The pronunciation tone can be predicted based on a pronunciation tone prediction model with a pitch predictor (pitch predictor), wherein the pitch predictor can determine the pitch corresponding to each pronunciation vector based on the convolution operation of the convolution network and the full-link layer. In addition, if the pronunciation tone prediction model is used to predict the dialect tones of the pronunciation vectors, the pitch output by the pitch predictor in the pronunciation tone prediction model is the dialect pitch corresponding to each pronunciation vector.
That is, in a preferred embodiment, the pronunciation tones may be dialect tones, which are used to characterize the pitch of the dialect pronunciation. In the process of synthesizing the original audio data, the dialect tones as the pronunciation tones can be added with the original audio data with the unique pitch of the dialect (i.e. the unique pronunciation mode of the dialect), so that the original audio data has the speaking mode more close to the human.
In step 64, the original audio data corresponding to the input text is synthesized based on the pronunciation vector, the pronunciation duration and the pronunciation pitch.
According to the embodiment of the invention, the original audio data can have the common speaking forms of human beings such as pause, prolonged voice and the like through the pronunciation vector, the prosody tag and the pronunciation duration corresponding to the pronunciation vector, then, the pitch can be added to the original audio data based on the pronunciation tone, so that the original audio data is closer to the speaking mode of human beings, and finally, the original audio data determined based on the pronunciation vector, the prosody tag, the pronunciation duration and the pronunciation tone can have higher similarity with the human voice.
Based on the same technical concept, an embodiment of the present invention further provides an audio processing apparatus, as shown in fig. 7, the apparatus including: the device comprises a first acquisition module and an information compensation module.
A first obtaining module 71, configured to obtain audio to be processed.
And the information compensation module 72 is configured to input the audio to be processed into a pre-trained information compensation model for processing, so as to obtain a target audio.
As shown in fig. 8, the information compensation model is trained based on the following modules: a second acquisition module 81 and a training module 82.
A second obtaining module 81, configured to obtain a training set, where the training set includes a plurality of sample groups, and the sample groups include the first audio sample after the dimension reduction processing and an original audio sample corresponding to the first audio sample.
A training module 82, configured to train the information compensation model according to the training set.
Optionally, the first obtaining module 71 is specifically configured to:
raw audio data is acquired.
And performing down-sampling processing on the original audio data to obtain audio to be processed.
Optionally, the first audio sample includes preset noise data.
Optionally, the noise data comprises white noise and/or pink noise.
Optionally, the second obtaining module 81 is specifically configured to:
a plurality of raw audio samples is obtained.
For an original audio sample, performing down-sampling processing on the original audio sample to obtain first audio data.
And combining a plurality of preset noise data with the first audio data respectively, and determining a plurality of corresponding first audio samples to obtain a plurality of sample groups corresponding to the original audio samples.
Optionally, the information compensation module 72 is specifically configured to:
and inputting the audio to be processed into a pre-trained information compensation model for up-sampling processing so as to determine the target audio.
Optionally, the first obtaining module 71 is further specifically configured to:
an input text is obtained.
Determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word.
And determining pronunciation time length and pronunciation tone corresponding to each pronunciation vector, wherein the pronunciation time length is used for representing the duration of pronunciation, and the pronunciation tone is used for representing the pitch of pronunciation.
And synthesizing original audio data corresponding to the input text based on the pronunciation vector, the pronunciation duration and the pronunciation tone.
Optionally, the pronunciation tones are dialect tones, and the dialect tones are used for representing the pitch of the dialect pronunciation.
Optionally, the information compensation model is constructed based on an autoregressive neural network or a generative countermeasure network.
According to the embodiment of the invention, the information compensation model trained based on the original audio sample has better information compensation capability, and when the trained information compensation model is used for carrying out information compensation on the audio to be processed, the similarity between the compensated part in the target audio and the real sound can be higher, so that the true degree of the target audio is higher, namely, the trained information compensation model has higher up-sampling accuracy.
Fig. 9 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 9, the electronic device shown in fig. 9 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 91 and a memory 92. The processor 91 and the memory 92 are connected by a bus 93. The memory 92 is adapted to store instructions or programs executable by the processor 91. The processor 91 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 91 implements the processing of data and the control of other devices by executing instructions stored by the memory 92 to perform the method flows of embodiments of the present invention as described above. The bus 93 connects the above components together, and also connects the above components to a display controller 94 and a display device and an input/output (I/O) device 95. Input/output (I/O) devices 95 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 95 are coupled to the system through an input/output (I/O) controller 96.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device) or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method of the above embodiments may be accomplished by specifying related hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (20)
1. A method of audio processing, the method comprising:
acquiring audio to be processed; and
inputting the audio to be processed into a pre-trained information compensation model for processing so as to obtain a target audio;
wherein the information compensation model is trained based on the following steps:
obtaining a training set, wherein the training set comprises a plurality of sample groups, and the sample groups comprise first audio samples subjected to dimensionality reduction and original audio samples corresponding to the first audio samples; and
and training the information compensation model according to the training set.
2. The method of claim 1, wherein the obtaining the audio to be processed comprises:
acquiring original audio data; and
and performing down-sampling processing on the original audio data to obtain audio to be processed.
3. The method of claim 1, wherein the first audio sample comprises predetermined noise data.
4. The method of claim 3, wherein the noise data comprises white noise and/or pink noise.
5. The method of claim 3 or 4, wherein the obtaining the training set comprises:
obtaining a plurality of original audio samples;
for an original audio sample, performing down-sampling processing on the original audio sample to obtain first audio data; and
and combining a plurality of preset noise data with the first audio data respectively, and determining a plurality of corresponding first audio samples to obtain a plurality of sample groups corresponding to the original audio samples.
6. The method of claim 1, wherein inputting the audio to be processed into a pre-trained information compensation model for processing to obtain a target audio comprises:
and inputting the audio to be processed into a pre-trained information compensation model for up-sampling processing so as to determine the target audio.
7. The method of claim 2, wherein the obtaining raw audio data comprises:
acquiring an input text;
determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word;
determining pronunciation time length and pronunciation tone corresponding to each pronunciation vector, wherein the pronunciation time length is used for representing duration of pronunciation, and the pronunciation tone is used for representing pitch of pronunciation; and
and synthesizing original audio data corresponding to the input text based on the pronunciation vector, the pronunciation duration and the pronunciation tone.
8. The method of claim 7, wherein the pronunciation tones are dialect tones, the dialect tones being used to characterize a pitch of a dialect pronunciation.
9. The method of claim 1, wherein the information compensation model is constructed based on an autoregressive neural network or a generative countermeasure network.
10. An audio processing apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring audio to be processed; and
the information compensation module is used for inputting the audio to be processed into a pre-trained information compensation model for processing so as to obtain a target audio;
wherein the information compensation model is trained based on the following modules:
the second obtaining module is used for obtaining a training set, wherein the training set comprises a plurality of sample groups, and each sample group comprises a first audio sample subjected to dimensionality reduction and an original audio sample corresponding to the first audio sample; and
and the training module is used for training the information compensation model according to the training set.
11. The apparatus of claim 10, wherein the first obtaining module is specifically configured to:
acquiring original audio data; and
and performing down-sampling processing on the original audio data to obtain audio to be processed.
12. The apparatus of claim 10, wherein the first audio sample comprises predetermined noise data.
13. The apparatus of claim 12, wherein the noise data comprises white noise and/or pink noise.
14. The apparatus according to claim 12 or 13, wherein the second obtaining module is specifically configured to:
obtaining a plurality of original audio samples;
for an original audio sample, performing down-sampling processing on the original audio sample to obtain first audio data; and
and combining a plurality of preset noise data with the first audio data respectively, and determining a plurality of corresponding first audio samples to obtain a plurality of sample groups corresponding to the original audio samples.
15. The apparatus of claim 10, wherein the information compensation module is specifically configured to:
and inputting the audio to be processed into a pre-trained information compensation model for up-sampling processing so as to determine the target audio.
16. The apparatus of claim 11, wherein the first obtaining module is further configured to:
acquiring an input text;
determining a pronunciation vector of at least one word in the input text, wherein the pronunciation vector at least comprises prosodic information of the corresponding word;
determining pronunciation time length and pronunciation tone corresponding to each pronunciation vector, wherein the pronunciation time length is used for representing duration of pronunciation, and the pronunciation tone is used for representing pitch of pronunciation; and
and synthesizing original audio data corresponding to the input text based on the pronunciation vector, the pronunciation duration and the pronunciation tone.
17. The apparatus of claim 16, wherein the pronunciation tones are dialect tones, the dialect tones being used to characterize a pitch of a dialect pronunciation.
18. The apparatus of claim 10, wherein the information compensation model is constructed based on an autoregressive neural network or a generative countermeasure network.
19. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-9.
20. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011613263.1A CN112735454A (en) | 2020-12-30 | 2020-12-30 | Audio processing method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011613263.1A CN112735454A (en) | 2020-12-30 | 2020-12-30 | Audio processing method and device, electronic equipment and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112735454A true CN112735454A (en) | 2021-04-30 |
Family
ID=75611841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011613263.1A Pending CN112735454A (en) | 2020-12-30 | 2020-12-30 | Audio processing method and device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112735454A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241054A (en) * | 2021-05-10 | 2021-08-10 | 北京声智科技有限公司 | Speech smoothing model generation method, speech smoothing method and device |
CN113569196A (en) * | 2021-07-15 | 2021-10-29 | 苏州仰思坪半导体有限公司 | Data processing method, device, medium and equipment |
CN114615610A (en) * | 2022-03-23 | 2022-06-10 | 东莞市晨新电子科技有限公司 | Audio compensation method and system of audio compensation type earphone and electronic equipment |
CN114900779A (en) * | 2022-04-12 | 2022-08-12 | 东莞市晨新电子科技有限公司 | Audio compensation method and system and electronic equipment |
CN115831147A (en) * | 2022-10-20 | 2023-03-21 | 广州优谷信息技术有限公司 | Method, system, device and medium for reading detection based on audio compensation |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08160983A (en) * | 1994-12-08 | 1996-06-21 | Sony Corp | Speech synthesizing device |
CN1622195A (en) * | 2003-11-28 | 2005-06-01 | 株式会社东芝 | Speech synthesis method and speech synthesis system |
CN101379549A (en) * | 2006-02-08 | 2009-03-04 | 日本电气株式会社 | Speech synthesizing device, speech synthesizing method, and program |
CN104823237A (en) * | 2012-11-26 | 2015-08-05 | 哈曼国际工业有限公司 | System, computer-readable storage medium and method for repair of compressed audio signals |
CN109872730A (en) * | 2019-03-14 | 2019-06-11 | 广州飞傲电子科技有限公司 | Distortion compensating method, method for establishing model and the audio output apparatus of audio data |
CN110782870A (en) * | 2019-09-06 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN111312231A (en) * | 2020-05-14 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Audio detection method and device, electronic equipment and readable storage medium |
CN111653271A (en) * | 2020-05-26 | 2020-09-11 | 大众问问(北京)信息科技有限公司 | Sample data acquisition method, sample data acquisition device, model training method, model training device and computer equipment |
-
2020
- 2020-12-30 CN CN202011613263.1A patent/CN112735454A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH08160983A (en) * | 1994-12-08 | 1996-06-21 | Sony Corp | Speech synthesizing device |
CN1622195A (en) * | 2003-11-28 | 2005-06-01 | 株式会社东芝 | Speech synthesis method and speech synthesis system |
CN101379549A (en) * | 2006-02-08 | 2009-03-04 | 日本电气株式会社 | Speech synthesizing device, speech synthesizing method, and program |
CN104823237A (en) * | 2012-11-26 | 2015-08-05 | 哈曼国际工业有限公司 | System, computer-readable storage medium and method for repair of compressed audio signals |
CN109872730A (en) * | 2019-03-14 | 2019-06-11 | 广州飞傲电子科技有限公司 | Distortion compensating method, method for establishing model and the audio output apparatus of audio data |
CN110782870A (en) * | 2019-09-06 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN111312231A (en) * | 2020-05-14 | 2020-06-19 | 腾讯科技(深圳)有限公司 | Audio detection method and device, electronic equipment and readable storage medium |
CN111653271A (en) * | 2020-05-26 | 2020-09-11 | 大众问问(北京)信息科技有限公司 | Sample data acquisition method, sample data acquisition device, model training method, model training device and computer equipment |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113241054A (en) * | 2021-05-10 | 2021-08-10 | 北京声智科技有限公司 | Speech smoothing model generation method, speech smoothing method and device |
CN113569196A (en) * | 2021-07-15 | 2021-10-29 | 苏州仰思坪半导体有限公司 | Data processing method, device, medium and equipment |
CN114615610A (en) * | 2022-03-23 | 2022-06-10 | 东莞市晨新电子科技有限公司 | Audio compensation method and system of audio compensation type earphone and electronic equipment |
CN114900779A (en) * | 2022-04-12 | 2022-08-12 | 东莞市晨新电子科技有限公司 | Audio compensation method and system and electronic equipment |
CN115831147A (en) * | 2022-10-20 | 2023-03-21 | 广州优谷信息技术有限公司 | Method, system, device and medium for reading detection based on audio compensation |
CN115831147B (en) * | 2022-10-20 | 2024-02-02 | 广州优谷信息技术有限公司 | Audio compensation-based reading detection method, system, device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3895159B1 (en) | Multi-speaker neural text-to-speech synthesis | |
US12033611B2 (en) | Generating expressive speech audio from text data | |
US20230064749A1 (en) | Two-Level Speech Prosody Transfer | |
CN112735454A (en) | Audio processing method and device, electronic equipment and readable storage medium | |
US11475874B2 (en) | Generating diverse and natural text-to-speech samples | |
CN108766413A (en) | Phoneme synthesizing method and system | |
EP4447040A1 (en) | Speech synthesis model training method, speech synthesis method, and related apparatuses | |
GB2603776A (en) | Methods and systems for modifying speech generated by a text-to-speech synthesiser | |
CN112712789A (en) | Cross-language audio conversion method and device, computer equipment and storage medium | |
Wu et al. | Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations | |
CN113948062B (en) | Data conversion method and computer storage medium | |
CN113782042B (en) | Speech synthesis method, vocoder training method, device, equipment and medium | |
CN110517662A (en) | A kind of method and system of Intelligent voice broadcasting | |
CN114005428A (en) | Speech synthesis method, apparatus, electronic device, storage medium, and program product | |
CN116168678A (en) | Speech synthesis method, device, computer equipment and storage medium | |
US11915714B2 (en) | Neural pitch-shifting and time-stretching | |
Chandra et al. | Towards the development of accent conversion model for (l1) bengali speaker using cycle consistent adversarial network (cyclegan) | |
Xu et al. | Two-pathway style embedding for arbitrary voice conversion | |
KR102526338B1 (en) | Apparatus and method for synthesizing voice frequency using amplitude scaling of voice for emotion transformation | |
JP6234134B2 (en) | Speech synthesizer | |
Louw | Neural speech synthesis for resource-scarce languages | |
JP2013003470A (en) | Voice processing device, voice processing method, and filter produced by voice processing method | |
Hirose | Use of generation process model for improved control of fundamental frequency contours in HMM-based speech synthesis | |
Luo et al. | Speech prosody conversion using sequence generative adversarial nets with continuous wavelet transform F0 features | |
CN118298803A (en) | Speech cloning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210430 |