WO2022105693A1 - 样本生成方法及装置 - Google Patents

样本生成方法及装置 Download PDF

Info

Publication number
WO2022105693A1
WO2022105693A1 PCT/CN2021/130459 CN2021130459W WO2022105693A1 WO 2022105693 A1 WO2022105693 A1 WO 2022105693A1 CN 2021130459 W CN2021130459 W CN 2021130459W WO 2022105693 A1 WO2022105693 A1 WO 2022105693A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
text
pair
detected
segment
Prior art date
Application number
PCT/CN2021/130459
Other languages
English (en)
French (fr)
Inventor
王冬晓
杨明祺
马楠
夏龙
郭常圳
Original Assignee
北京猿力未来科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猿力未来科技有限公司 filed Critical 北京猿力未来科技有限公司
Priority to KR1020237017827A priority Critical patent/KR20230079503A/ko
Priority to US18/253,717 priority patent/US11810546B2/en
Publication of WO2022105693A1 publication Critical patent/WO2022105693A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters

Definitions

  • This specification relates to the technical field of data processing, and in particular, to a method and device for generating samples.
  • speech synthesis also known as text-to-speech technology
  • TTS text-to-speech
  • the waveform splicing method requires a long time of training data to complete speech synthesis
  • the parameter-based synthesis method Although speech synthesis can be completed, the reference factors are few, resulting in an unsatisfactory final synthesis result; the most widely used end-to-end synthesis method based on neural network in the prior art is the amount of data required by this method.
  • the embodiments of the present specification provide a sample generation method.
  • This specification also relates to a sample generating apparatus, a computing device, and a computer-readable storage medium, so as to solve the technical defects existing in the prior art.
  • a sample generation method including:
  • each text-audio pair contains a text segment and an audio segment
  • the text-audio pair to be detected meets the preset detection condition, the text-audio pair to be detected is written into the training database.
  • the acquiring multiple text-audio pairs includes:
  • the audio is preprocessed to obtain target audio, and the target text is converted into a phoneme sequence;
  • the phoneme sequence is aligned with the target audio, and the plurality of text-audio pairs are generated according to the alignment result.
  • the generating the multiple text-audio pairs according to the alignment processing result includes:
  • Segment the phoneme audio file according to the segmentation position to obtain a plurality of phoneme audio pairs, wherein each phoneme audio pair includes a phoneme segment and an audio segment;
  • the plurality of text-audio pairs are generated according to the text segments corresponding to the phoneme segments in each phoneme-audio pair and the audio segments in each phoneme-audio pair.
  • the calculating the audio feature of the audio segment of each text-audio pair in the plurality of text-audio pairs includes:
  • the audio feature of the audio segment of each text-audio pair is determined according to the pitch frequency feature and the audio frame feature of the audio segment of each text-audio pair.
  • filtering out the target text-audio pair and the spliced text-audio pair corresponding to the target text-audio pair from the multiple text-audio pairs according to the audio feature including:
  • Integrate audio clips, text clips and audio features of each text-audio pair in the multiple text-audio pairs obtain a text-audio package corresponding to each text-audio pair, and write into the fragment database;
  • the fragment database select any text audio package as the target text audio package, and determine the text audio pair in the target text audio package as the target text audio pair;
  • a spliced text-audio package is determined based on the text-audio package other than the target text-audio package and the audio feature in the segment database, and the text-audio pair in the spliced-text-audio package is used as the spliced-text-audio pair.
  • determining the splicing text audio package based on the text audio package other than the target text audio package and the audio feature in the fragment database including:
  • the spliced text-audio package is filtered out of the to-be-screened text-audio package set.
  • the spliced text is screened out from the set of text-audio packets to be screened.
  • Audio package including:
  • the to-be-screened text-audio package to which the to-be-screened text-audio pair whose characteristic distance is smaller than the preset distance threshold belongs is determined as the spliced text-audio package.
  • step of splicing the target text-audio pair and the splicing text-audio pair into a to-be-detected text-audio pair before the step of splicing the target text-audio pair and the splicing text-audio pair into a to-be-detected text-audio pair, and performing the step of detecting the to-be-detected text-audio pair, further comprising:
  • splicing the target text-audio pair and the splicing text-audio pair into a to-be-detected text-audio pair including:
  • the target text fragment and the spliced text fragment are spliced into a text fragment to be detected, and the target audio fragment and the spliced audio fragment are spliced into an audio fragment to be detected;
  • the to-be-detected text-audio pair is formed based on the to-be-detected text segment and the to-be-detected audio segment.
  • the detecting the to-be-detected text-audio pair includes:
  • writing the text-audio pair to be detected into the training database includes:
  • the text-audio pair to be detected is written into the training database.
  • the method further includes:
  • the method further includes:
  • a speech synthesis model is trained based on the sample text segment and the sample audio segment to obtain a target speech synthesis model.
  • a sample generating apparatus including:
  • an acquisition module configured to acquire a plurality of text-audio pairs, wherein each text-audio pair includes a text segment and an audio segment;
  • the calculation module is configured to calculate the audio feature of the audio segment of each text-audio pair in the plurality of text-audio pairs, and filter out the target text-audio pair and all the text-audio pairs from the plurality of text-audio pairs according to the audio feature. Describe the spliced text-audio pair corresponding to the target text-audio pair;
  • a splicing module configured to splicing the target text-audio pair and the splicing text-audio pair into a to-be-detected text-audio pair, and to detect the to-be-detected text-audio pair;
  • the writing module is configured to write the text-audio pair to be detected into a training database when the text-audio pair to be detected meets a preset detection condition.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions:
  • each text-audio pair contains a text segment and an audio segment
  • the text-audio pair to be detected satisfies the preset detection condition
  • the text-audio pair to be detected is written into a training database.
  • a computer-readable storage medium which stores computer-executable instructions, which implement the steps of the sample generation method when the instructions are executed by a processor.
  • This specification provides a sample generation method, after acquiring multiple text-audio pairs, calculating the audio feature of the audio segment of each text-audio pair in the multiple text-audio pairs, and generating the The target text-audio pair and the spliced text-audio pair corresponding to the target text-audio pair are screened out from the text-audio pairs, and then the target text-audio pair and the spliced-text-audio pair are spliced into the to-be-detected text-audio pair, and Detecting the to-be-detected text-audio pair, and writing the to-be-detected text-audio pair into the training database when the to-be-detected text-audio pair satisfies a preset detection condition, to achieve in the sample data preparation stage, High-quality sample data that meets the needs of downstream
  • FIG. 1 is a flowchart of a sample generation method provided by an embodiment of the present specification
  • FIG. 2 is a schematic diagram of an alignment processing result in a sample generation method provided by an embodiment of the present specification
  • FIG. 3 is a schematic diagram of a segmentation processing result in a sample generation method provided by an embodiment of the present specification
  • FIG. 5 is a flowchart of a sample generation method applied in a speech synthesis scenario provided by an embodiment of this specification
  • FIG. 6 is a schematic structural diagram of a sample generating apparatus provided by an embodiment of the present specification.
  • FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification.
  • F0 fundamental frequency
  • the general sound is composed of a series of vibrations with different frequencies and amplitudes emitted by the sounding body; among these vibrations, there is a vibration with the lowest frequency, and the sound emitted by it is the fundamental sound.
  • the corresponding frequency is the fundamental frequency.
  • Forced Alignment A technique for obtaining the temporal correspondence of a given phoneme sequence and speech, which can be achieved by forced alignment tools such as kaldi, an open source speech recognition tool (Toolkit) that uses WFST to implement the decoding algorithm ) or HTK (HMM Toolkit, a speech processing tool based on the hmm model), etc., to achieve the alignment of phoneme sequences and audio.
  • forced alignment tools such as kaldi, an open source speech recognition tool (Toolkit) that uses WFST to implement the decoding algorithm ) or HTK (HMM Toolkit, a speech processing tool based on the hmm model), etc.
  • a phoneme is the smallest phonetic unit divided according to the natural properties of speech. It is analyzed according to the pronunciation action in the syllable, and an action constitutes a phoneme. Phonemes are divided into vowels and consonants. For example, the Chinese syllable ah ( ⁇ ) has only one phoneme, love (ài) has two phonemes, and dai (dài) has three phonemes, etc. In Chinese, phonemes are pinyin; in English, phonemes are phonetic symbols.
  • a sample generation method is provided, and the specification also relates to a sample generation apparatus, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.
  • This specification provides a sample generation method, after acquiring multiple text-audio pairs, calculating the audio feature of the audio segment of each text-audio pair in the multiple text-audio pairs, and generating the The target text-audio pair and the spliced text-audio pair corresponding to the target text-audio pair are screened out from the text-audio pairs, and then the target text-audio pair and the spliced-text-audio pair are spliced into the to-be-detected text-audio pair, and Detecting the to-be-detected text-audio pair, and writing the to-be-detected text-audio pair into the training database when the to-be-detected text-audio pair satisfies a preset detection condition, to achieve in the sample data preparation stage, High-quality sample data that meets the needs of downstream business use can be obtained by splicing, which saves the cost of resource consumption in the data preparation stage, and the amount of sample data written into the training database after splicing is large, which effectively solves the problem.
  • FIG. 1 shows a flowchart of a sample generation method according to an embodiment of the present specification, which specifically includes the following steps:
  • Step S102 acquiring a plurality of text-audio pairs, wherein each text-audio pair includes a text segment and an audio segment.
  • the text-audio pair specifically refers to a queue composed of text segments and audio segments that have a corresponding relationship.
  • the text segments include but are not limited to word units, word units, or sentence units.
  • the audio segments include but are not limited to Phonetics that match word units, word units, or sentence units.
  • sample data is realized by splicing, so that a large amount of sample data can be spliced for downstream business, and in order to ensure the quality requirements of sample data, the splicing process will be completed in combination with audio features, so as to complete the preparation of sample data.
  • any one of a small number of texts is used as an example to describe the sample generation method.
  • a text-audio pair in this embodiment, the specific implementation is as follows:
  • the audio is preprocessed to obtain target audio, and the target text is converted into a phoneme sequence;
  • the phoneme sequence is aligned with the target audio, and the plurality of text-audio pairs are generated according to the alignment result.
  • the target text includes but is not limited to an article or a sentence, etc.
  • the audio specifically refers to the voice generated for the target text, and the audio corresponding to the target text can be recorded or It is generated by speech synthesis, and this embodiment does not make any limitation here.
  • the matching degree of the audio and the target text is relatively high, so as to ensure that more data can be written into the training database during subsequent splicing.
  • the target audio specifically refers to the audio obtained by standardizing the audio
  • the phoneme sequence specifically refers to a sequence composed of the smallest units that constitute the target text
  • the alignment process specifically refers to the Find the time interval corresponding to the text in the audio.
  • the alignment process will be completed starting from the smallest unit of text when the text-audio pair is generated, that is, when the target text is obtained.
  • the audio is preprocessed to obtain the target audio, and the part that can cause interference to the subsequent processing in the audio is removed, such as the blank audio segment at the beginning and/or the end of the audio ( Unvoiced audio clips) or loud audio clips at the beginning and/or end of the audio (the pronunciation content of the audio clip cannot be distinguished), etc.;
  • the target text is converted into a phoneme sequence, which is completed by the smallest unit Align text and audio to improve alignment accuracy; finally, perform alignment processing on the phoneme sequence and the target audio, and obtain the multiple text-audio pairs according to the alignment processing result.
  • the kaldi alignment tool or the HTK alignment tool can be used to complete; in addition, other alignment tools can also be selected according to actual needs to complete the phoneme sequence and the target audio.
  • the alignment of the target audio is not limited in this embodiment.
  • the specific implementation method is as follows:
  • Segment the phoneme audio file according to the segmentation position to obtain a plurality of phoneme audio pairs, wherein each phoneme audio pair includes a phoneme segment and an audio segment;
  • the plurality of text-audio pairs are generated according to the text segments corresponding to the phoneme segments in each phoneme-audio pair and the audio segments in each phoneme-audio pair.
  • the phoneme audio file specifically refers to a file obtained by aligning the phoneme sequence and the target audio;
  • the segmentation position may be the position of the sentence segment in the target audio or the position where the pronunciation interruption time exceeds the set time threshold;
  • the phoneme-audio pair specifically refers to a queue composed of phoneme segments and audio segments having a corresponding relationship.
  • the phoneme audio file is segmented to obtain a plurality of phoneme audio pairs, wherein each phoneme audio pair includes a phoneme fragment and its corresponding audio fragment, and then based on the target text.
  • the phoneme segment of each phoneme-audio pair is converted into a text segment, so that the text-audio pair is formed according to the text segment and the audio segment corresponding to the phoneme segment in each phoneme-audio pair, and the text-audio pair includes the text segment and its corresponding audio clip.
  • the multiple text-audio pairs formed at this time can realize the splicing of sample data for writing into the training database in subsequent processing, and complete the preparation of the sample data.
  • the segmented phoneme can be realized.
  • the phoneme segments and audio segments included in the audio pair are also corresponding; and according to the characteristics of the user's speech, the phoneme segments included in the segmented phoneme audio pair can ensure that the corresponding text segment is found in the target text, and will not There is a problem that the phoneme fragment is incomplete after being segmented.
  • the target text is "I watched a wonderful football game", and 12s of audio is generated for the target text.
  • the blank audio segments at the beginning and end of the audio will be deleted.
  • the target audio with a time length of 10s is obtained, and in order to improve the alignment accuracy, the phoneme sequence corresponding to the target text "I watched a wonderful football game” can be converted (wo kan le yi chang jing cai de zu qiu bi sai) , and use the kaldi alignment tool to align the phoneme sequence and the target audio, and obtain the alignment processing result shown in Figure 2, that is, the phoneme audio file composed of the phoneme sequence and the target audio.
  • the phoneme audio file is segmented according to the segmentation positions to obtain five phoneme audio pairs, and the first phoneme audio pair P1 consists of the first phoneme segment (wo kan le) and the first phoneme audio pair.
  • Audio segment (0s ⁇ 3s); the second phoneme audio pair P2 consists of the second phoneme segment (yi chang) and the second audio segment (3s ⁇ 4s); the third phoneme audio pair P3 consists of the third phoneme segment ( jing cai de) and the third audio segment (4s ⁇ 6s); the fourth phoneme audio pair P 4 consists of the fourth phoneme segment (zu qiu) and the fourth audio segment (6s ⁇ 8s); the fifth phoneme audio pair P 5 consists of a fifth phoneme segment (bi sai) and a fifth audio segment (8s-10s).
  • the phoneme audio pairs P 1 to P 5 are obtained, it is also necessary to convert the phoneme segments in each phoneme audio pair into text segments, so as to obtain text-audio pairs that can be used for subsequent splicing processing.
  • the target text "I watched a wonderful football game” can determine the text segment corresponding to the phoneme segment in each phoneme audio pair, that is, the first phoneme segment (wo kan le) corresponding to the first phoneme segment (wo kan le) included in the first phoneme audio pair P1.
  • the text segment is (I watched); the second text segment corresponding to the second phoneme segment (yi chang) contained in the second phoneme audio pair P2 is (one field); the third phoneme audio pair P3 contains the The third text fragment corresponding to the triphone fragment (jing cai de) is (wonderful); the fourth text fragment corresponding to the fourth phoneme fragment (zu qiu) contained in the fourth phoneme audio pair P4 is (football); The fifth text segment corresponding to the fifth phoneme segment (bi sai) contained in the pentaphone audio pair P5 is (match).
  • a plurality of text-audio pairs corresponding to the target text and the target audio can be generated according to the obtained text fragments and audio fragments, as shown in the segmentation result shown in FIG. (I watched) and the first audio segment (0s ⁇ 3s);
  • the second text-audio pair TA 2 consists of the second text segment (one field) and the second audio segment (3s ⁇ 4s);
  • the third text-audio pair TA 3 consists of a third text segment (wonderful) and a third audio segment (4s ⁇ 6s);
  • the fourth text-audio pair TA 4 consists of a fourth text segment (soccer) and a fourth audio segment (6s ⁇ 8s);
  • the fifth text-audio pair TA 5 is composed of a fifth text segment (competition) and a fifth audio segment (8s to 10s); it is used for subsequent splicing out of sample data that meets the requirements for writing into the training database, and is used when training a speech synthesis model .
  • Step S104 calculating the audio feature of the audio segment of each text-audio pair in the multiple text-audio pairs, and filtering out the target text-audio pair and the target text from the multiple text-audio pairs according to the audio feature
  • the audio pair corresponds to the concatenated text-audio pair.
  • the text-audio pairs written into the training database are used for training the model, in order to improve the prediction accuracy of the trained model , it is also necessary to ensure the quality of the sample data used by the training model, that is, when splicing out a text-audio pair that can be written into the training database, it is also necessary to consider the timbre and rhythm between the text-audio pairs before splicing;
  • the two text-audio pairs are different or similar in terms of timbre and rhythm, or the pitch fluctuations are inconsistent, then the spliced text-audio pair has the problem of mismatching audio fragments and inconsistent semantics between the text fragments before and after. cannot be used to train the model.
  • the present application will calculate each text-audio pair before splicing the text-audio pair. Then, based on the audio features, select text-audio pairs that can be spliced from multiple text-audio pairs, and realize the splicing of text-audio pairs with similar attributes such as pitch and rhythm, so as to obtain continuous and Text-audio pairs with semantically consistent text segments are obtained to obtain high-quality text-audio pairs for use in subsequent training models.
  • the audio features include but are not limited to features that characterize the pitch frequency of audio clips, features of audio frames and/or features of audio frame energy, etc.
  • the audio features of the audio clips in the text-audio pairing it is possible to analyze the audio features of the audio clips that need to be spliced Whether the text-audio pair is suitable for splicing, that is, through the fundamental frequency feature, the feature of the audio frame and/or the energy feature of the audio frame, it is determined whether the pitch, rhythm and other attributes between the text-audio pairs that need to be spliced are similar or the same.
  • the audio feature filters out the spliced text-audio pair from the plurality of text-audio pairs; the target text-audio pair specifically refers to a reference text-audio pair, and the spliced-text-audio pair is a pair that satisfies the reference text-audio pair. Text-audio pairs for splicing conditions.
  • each the audio features of the audio clips in the text-audio pair in order to obtain text-audio pairs that can be spliced with each other (that is, the timbre and rhythm between the text-audio pairs are similar or the same) to generate more sample data, each the audio features of the audio clips in the text-audio pair, and after determining the target text-audio pair, the audio features of the audio clips in the target text-audio pair and the audio features of the audio clips of the respective text-audio pairs in the plurality of text-audio pairs , screen out the splicing text-audio pairs corresponding to the target text-audio pairs from the plurality of text-audio pairs, for subsequent generation of sample data, so that when a large amount of sample data is spliced out, not only the number of sample data is satisfied It is also required to combine audio features to ensure the similarity between the text-audio pairs before splicing, thereby improving the quality of the text-audio pairs after splicing.
  • the audio features of each text-audio pair can be Fragment processing is performed on the clips, and the audio features are analyzed through audio frames.
  • the specific implementation is as follows:
  • the audio feature of the audio segment of each text-audio pair is determined according to the pitch frequency feature and the audio frame feature of the audio segment of each text-audio pair.
  • the fundamental frequency feature specifically refers to the frequency value corresponding to the vibration with the lowest frequency among a series of vibrations with different frequencies and amplitudes emitted by the sound body in the audio segment
  • the audio frame feature specifically refers to the audio After the audio frame in the clip is Fourier transformed, the points on the spectrum are calculated to obtain the frame energy value; correspondingly, the pitch frequency feature can be used to analyze whether the pronunciation vibration amplitudes between the text and audio pairs are spliced.
  • the audio frame feature can be used to analyze whether the energy distribution between the text-audio pairs is similar or the same when splicing; to realize the selection of the text-audio pair with better effect after splicing through the pitch frequency and frame energy Splicing is performed to obtain sample data that meets the needs of use.
  • the audio frame calculates the fundamental frequency feature and audio frame feature of the audio segment in each text-audio pair, and finally determines the audio segment of each text-audio pair according to the fundamental frequency feature and audio frame feature of the audio segment of each text-audio pair. audio characteristics.
  • the audio segment of each text-audio pair can be calculated when calculating the audio features. start audio features (start pitch frequency feature and start audio frame feature) and end audio features (end pitch frequency feature and end audio frame feature), and then filter the target audio from the plurality of text-audio pairs
  • start audio features start pitch frequency feature and start audio frame feature
  • end audio features end pitch frequency feature and end audio frame feature
  • the initial audio feature after the calculation, the splicing text-audio pair is screened out, and then when the target text-audio pair and the splicing-text-audio pair are spliced, the target text-audio pair can be used as the initial text-audio pair, The spliced text-audio pair is used as the ending text-audio pair, and the two are spliced in order to obtain the to-be-detected text-audio pair that needs to be detected subsequently.
  • the target text-audio pair can also be used as the ending text-audio pair, and then based on the starting audio feature of the audio clip in the target text-audio pair and the ending audio feature of the audio clip in each text-audio pair, the described When splicing the text-audio pair, and then splicing the target text-audio pair and the splicing text-audio pair, the target text-audio pair can be used as the ending text-audio pair, and the splicing text-audio pair can be used as the starting text-audio pair.
  • splicing the two in order to obtain the text-audio pair to be detected that needs to be detected later; and in this process, the target text-audio pair has been used as the starting text-audio pair and the ending text-audio pair and other text-audio pairs Possibility of splicing is performed.
  • the processing process of splicing with the target text-audio pair can be omitted, so as to improve the processing efficiency in the subsequent splicing process.
  • the calculation of the pitch frequency feature can be realized by a time domain estimation method, that is, the pitch frequency is estimated directly from the audio waveform, such as an autocorrelation method, a parallel processing method, an average amplitude difference method or a data reduction method; it can also be achieved by transforming
  • the method is realized by transforming the audio speech signal into the frequency domain or time domain to estimate the pitch frequency. First, the influence of the channel is eliminated by the homomorphic analysis method, and the information belonging to the excitation part is obtained, and then the calculation of the pitch frequency is performed, such as inversion.
  • Spectral method or can be realized by a hybrid method, that is, first extract the signal channel model parameters, then use it to filter the signal to obtain a sound source sequence, and finally use the autocorrelation method or the average amplitude difference method to calculate the pitch frequency; calculate
  • the pitch frequency feature of the audio segment may be implemented by selecting an appropriate method according to an actual application scenario, which is not limited in this embodiment.
  • the calculation of the audio frame feature of the audio segment may be implemented by selecting an appropriate method according to an actual application scenario, which is not limited in this embodiment.
  • the frame when framing the audio clips in each text-audio pair, the frame can be processed according to a fixed frame length, such as 32ms or 64ms, and the specific frame length can be set according to actual needs. , this implementation does not make any limitation here.
  • the starting pitch frequency and Start frame energy, end pitch frequency and end frame energy based on this, first lift the audio clips in each text-audio pair, and perform frame processing on each audio clip to obtain five sets of audio frames, corresponding to the text-audio pairs respectively.
  • the audio features of the audio clips in each text-audio pair will be calculated in advance, and the attribute information of the audio clips in each text-audio pair will be analyzed in the attribute dimension.
  • a target text-audio pair corresponds to a spliced text-audio pair
  • a text-audio pair with a better effect after splicing can be selected in combination with the audio features as the splicing text-audio pair, so as to improve the quality of the sample data.
  • the target text-audio pair and the splicing text-audio pair will be screened out in the text-audio pair according to the audio features for subsequent follow-up.
  • the splicing process is performed to obtain sample data that meets the writing requirements.
  • the specific implementation methods are steps S1042 to S1052 as shown in FIG. 4 .
  • Step S1042 integrating the audio segment, text segment and audio feature of each text-audio pair in the plurality of text-audio pairs, obtaining a text-audio package corresponding to each text-audio pair, and writing the segment database;
  • Step S1044 selecting any text audio package as the target text audio package in the fragment database, and determining the text audio pair in the target text audio package as the target text audio pair;
  • Step S1046 selecting text audio packages other than the target text audio package in the fragment database to form a set of text audio packages to be screened;
  • Step S1048 determining the text-audio pair of each to-be-screened text-audio package included in the to-be-screened text-audio package set as the to-be-screened text-audio pair;
  • Step S1050 based on the audio features of the audio clips of the target text-audio pair and the audio features of the audio clips of the to-be-screened text-audio pair, screen out the spliced text-audio package from the text-audio package set to be screened;
  • Step S1052 taking the text-audio pair in the spliced text-audio package as the spliced-text-audio pair.
  • the text-audio package specifically refers to a set composed of text-audio pairs written into a fragment database and their corresponding text features
  • the fragment database specifically refers to temporarily storing text fragments, audio fragments and their corresponding text-audio pairs in text-audio pairs.
  • the database of audio features of the audio feature after obtaining the plurality of text-audio pairs, since it will take a certain amount of time to screen the associated splicing text-audio pairs for the target text-audio pairs, the text-audio package can be written into
  • For the segment database when splicing processing is required, text and audio pairs can be extracted from the segment database for subsequent splicing processing.
  • the to-be-screened text-audio packages contained in the to-be-screened text-audio package set specifically refer to other text-audio packages in the fragment database except the target text-audio package.
  • the to-be-screened text and audio packages are That is, the text-audio pair contained in the to-be-screened text-audio package; the spliced text-audio package specifically refers to the text-audio package to which the text-audio pair that can be spliced with the target text-audio package belongs.
  • any text audio package is selected from the fragment database as the target text audio package, and the text audio pair contained in the target text audio package is extracted as the target text audio pair.
  • other text and audio packets except the target text and audio packets are selected as the text and audio packets to be screened, and a set of text and audio packets to be screened is formed.
  • the matching degree between the target text-audio pair and each text-audio pair to be screened can be calculated, and then the matching degree is selected.
  • the text-audio package to which the higher to-be-screened text-audio pair belongs can be used as the splicing text-audio package, that is, the text-audio pair in the splicing text-audio package is used as the splicing text-audio pair corresponding to the target text-audio pair, It is used for subsequent splicing of the two to obtain sample data that meets the requirements of writing to the training database.
  • the splicing text-audio package can be obtained in the following manner, so that the text-audio pair in the splicing text-audio package is used as the splicing text-audio pair for use. Subsequent splicing with the target text-audio pair to obtain sample data that satisfies writing into the training database.
  • the specific implementation method is as follows:
  • the to-be-screened text-audio package to which the to-be-screened text-audio pair whose characteristic distance is smaller than the preset distance threshold belongs is determined as the spliced text-audio package.
  • the first audio feature is the audio feature of the audio segment in the target text-audio pair
  • the second audio feature is the audio feature of the audio segment in the text-audio pair to be screened.
  • the The feature distance specifically refers to a numerical value for evaluating the degree of matching between text-audio pairs. The larger the feature distance is, the lower the matching degree between the text-audio pairs is. On the contrary, the smaller the feature distance is, indicating that the The higher the match between text-audio pairs.
  • the first audio feature can be determined according to the first audio feature. and the second audio feature, calculate the feature distance between the target text-audio pair and each to-be-screened text-audio pair, then select the to-be-screened text-audio pair whose feature distance is less than the preset distance threshold as the splicing text-audio Yes, it can be used for subsequent splicing processing.
  • L represents the feature distance
  • F0 e represents the end pitch frequency feature of the audio segment in the target text-audio pair
  • F0 s represents the starting pitch frequency feature of the audio segment in the text-audio pair to be screened
  • E e represents the target text-audio mid-range audio
  • E s represents the start audio frame feature of the audio segment in the text-audio pair to be screened.
  • the text-audio pairs and their corresponding audio features can be integrated into a text-audio package (TP 1 to TP 5 ) . ) is written into the segment database D, so that text audio packets can be selected for splicing processing during subsequent splicing.
  • select text audio package TP 1 as target text audio package in fragment database D determine that text audio pair TA 1 contained in text audio package TP 1 is the target text audio pair at this time;
  • select text audio pair simultaneously Packages TP 2 , TP 3 , TP 4 , and TP 5 are used as text-audio packages to be screened, then the text-audio pairs TA 2 , TA 3 , TA 4 , and TA 5 in each to-be-screened text-audio package are used as text-audio pairs to be screened,
  • the text-audio pair to be screened of LT is regarded as a spliced text-audio pair that can be spliced with the target text-audio pair TA 1 , and the characteristic distances L 1 , L 3 and L 4 are determined to be less than the distance threshold LT through the comparison results, which further indicates that the target text is being spliced
  • the audio pair TA 1 is spliced with the text audio pair TA 2 , TA 4 and TA 5 to be screened, it can be ensured that the timbre and rhythm of each other are relatively close, so as to satisfy the requirement that higher quality sample data can be spliced later.
  • the text-audio pairs TA 2 , TA 4 and TA 5 to be screened can be spliced with the target text-audio pair TA 1 , then the text-audio pairs TA 2 , TA 4 and TA 5 are determined as the spliced text of the target text-audio pair TA 1 audio pair.
  • the target text-audio pairs can be used as backward text-audio pairs, and the text-audio pairs to be screened can be used as forward text-audio pairs, Then calculate the feature distance between each other.
  • the spliced-text-audio pair will be screened in combination with the audio features, so as to realize the filtered text.
  • the audio pair and the target text-audio pair are close to each other in terms of timbre, rhythm and other attributes, and the to-be-detected text-audio pair that meets the usage requirements can be spliced out in the follow-up, so as to realize the expansion of the training database for the downstream business. use.
  • Step S106 splicing the target text-audio pair and the spliced text-audio pair into a to-be-detected text-audio pair, and detecting the to-be-detected text-audio pair.
  • the target text-audio pair corresponding to the target text-audio pair based on the audio feature further, splicing the target text-audio pair and the spliced-text-audio pair to obtain The text-audio pair to be detected, and because the text-audio pair to be detected is spliced from two text-audio pairs, in order to be able to further ensure that the text-audio pairs written in the training database have quality assurance (splicing).
  • the text-audio pair to be detected before writing to the training database, can also be detected to detect whether the audio segment in the text-audio pair to be detected is clear, and the text segment Whether the length is appropriate, etc., so as to obtain a text-audio pair with better quality and write it into the training database.
  • the target text-audio pair may meet the requirements for writing into the training database, that is, the target text-audio pair can be written into the training database without being spliced with other text-audio pairs, so in order to improve the richness of the training database
  • the specific implementation is as follows:
  • the target sampling information specifically refers to the number of sampling bits and the sampling frequency when randomly sampling the audio clips in the target text-audio pair.
  • the sampling frequency refers to a The number of sampling times of audio clips in seconds. The higher the sampling frequency, the more realistic and natural the restoration of the audio clips will be. On the contrary, the lower the sampling frequency, the more unreal and unnatural the restoration of the audio clips will be;
  • the target text information specifically refers to the The length information, character quantity information, etc.
  • the preset detection condition specifically refers to detecting whether the audio fragment and the text fragment meet the conditions for writing into the training database, which can be in the text-audio pair in the text-audio pair.
  • the preset detection conditions write it into the training database, or write it into the training database when the audio clip or text clip in the text-audio pair meets the preset detection conditions the training database.
  • step S106 may be performed, and then the spliced text-audio pairs to be detected are detected, so as to achieve balanced writing of the audio clips and text clips in the text-audio pairs in the training database, It is ensured that the text and audio segments in the training database are similar or identical in form (audio length, text length or audio energy, etc.), which
  • the text-audio pair TA 1 from the text-audio pairs TA 1 to TA 5 as the target text-audio pair, further, at this time, the first audio segment (0s) in the text-audio pair TA 1 is selected.
  • the target text-audio pair can be detected, so as to avoid omission of the text-audio pair that meets the requirements for writing into the training database, thereby improving the richness of the training database.
  • each text-audio pair contains a text segment and an audio segment, it is necessary to splicing the audio segment while splicing the text segment to generate the described audio segment.
  • the text audio segment to be detected in this embodiment, the specific implementation is as follows:
  • the to-be-detected text-audio pair is formed based on the to-be-detected text segment and the to-be-detected audio segment.
  • first extract the target text segment and the target audio segment in the target text-audio pair simultaneously extract the spliced text segment and the spliced audio segment in the spliced text-audio pair, and then extract the target text segment and the spliced audio segment
  • the text fragments are spliced into text fragments to be detected, and the target audio fragments and the spliced audio fragments are spliced into audio fragments to be detected; finally, the to-be-detected text fragments and the to-be-detected audio fragments can be formed into the to-be-detected audio fragments. Detect text-audio pairs.
  • the specific implementation is as follows Said:
  • the text-audio pair to be detected is written into the training database.
  • the splicing process will be performed on the target text-audio pair and the splicing text-audio pair, that is, extracting the target
  • the first audio segment (0s ⁇ 3s) and the first text segment (I watched) in the text-audio pair TA 1 , and the second audio segment (3s ⁇ 4s) and the second text in the spliced text-audio pair TA 2 are extracted at the same time segment (one field), extract the fourth frequency segment (6s ⁇ 8s) and the fourth text segment (football) in the spliced text-audio pair TA 4 , and extract the fifth audio segment (8s ⁇ 10s) in the spliced text-audio pair TA 5 ) and the fifth text fragment (match).
  • first audio clip and the second audio clip are spliced to obtain the first audio clip to be detected (length is 4s), and the first audio clip and the fourth audio clip are spliced to obtain the second audio clip to be detected ( The length is 5s), and the first audio clip and the fifth audio clip are spliced to obtain the third audio clip to be detected (length is 5s); at the same time, the first text clip and the second text clip are spliced to obtain the first to-be-detected audio clip.
  • the first audio segment to be detected and the first text segment to be detected are combined into the first text-audio pair to be detected, the second audio segment to be detected and the The two to-be-detected text segments are combined into a second to-be-detected text-audio pair, and the third to-be-detected audio segment and the third to-be-detected text segment are combined into a third to-be-detected text-to-audio pair.
  • the above-mentioned three text-audio pairs to be detected are further detected.
  • the text-audio pairs written in the training database are used as sample data. Based on this, random sampling is performed on the to-be-detected audio segments in each to-be-detected text-audio pair between [0, 1], and it is determined that the first to-be-detected text-audio pair is centered.
  • the sampling result of the first to-be-detected audio segment is U 1
  • the sampling result of the second to-be-detected audio segment in the second to-be-detected text-audio pair is determined to be U 2
  • the third to-be-detected text-audio pair is determined to be the third to-be-detected audio segment
  • the sampling result is U 3 ; at the same time, it is determined that the text length of the first text segment to be detected in the first text audio pair to be detected is X 1
  • the text length of the second text audio pair to be detected in the second text audio pair to be detected is determined to be X 1 2.
  • Determine the text length of the third to-be-detected text segment in the third to-be-detected text-audio pair as X 3 .
  • sampling results U 1 , U 2 and U 3 are greater than the preset sampling result Ut, and whether the text lengths X 1 , X 2 and X 3 are less than the preset text length Xt, and it is determined according to the judgment results that the sampling result U 2 is greater than The preset sampling result Ut, and the text length X2 is less than the preset text length Xt, and the sampling result U3 is greater than the preset sampling result Ut, and the text length X3 is less than the preset text length Xt, that is, the second text-audio pair to be detected and the third text-audio pair to be detected is written into the training database T, then the second text-audio pair to be detected (audio 5s, text "I watched football"), and the third text-to-audio pair to be detected (audio 5s, text "I watched the game”) is written into the training database T as sample data for subsequent training of the speech synthesis model.
  • the text-audio pairs to be detected are detected through the audio dimension and the text dimension, so that the text-audio pairs written in the training database all meet the writing requirements, which effectively improves the performance in the training database. the quality of the sample data.
  • Step S108 in the case that the to-be-detected text-audio pair satisfies a preset detection condition, write the to-be-detected text-audio pair into a training database.
  • the text-audio pair to be detected meets the preset detection conditions, it is indicated that the text-audio pair to be detected meets the requirements for writing into the training database, and the text-audio pair to be detected is used as sample data It is enough to write into the training database, so that when training the speech synthesis model later, sample data that meets the training requirements can be extracted from the training database, so as to improve the prediction accuracy of the trained speech synthesis model.
  • the number of data pieces written to the training database can be limited, that is, the number of data pieces written to the training database can be limited.
  • the text-audio pairs to be detected of the preset detection conditions it is detected whether the number of text-audio pairs in the training database is less than or equal to the preset data volume threshold.
  • the text-audio pairs that meet the preset detection conditions can be written into the training database. If the value is greater than that, it means that the training database cannot continue to write the text-audio pairs, and the subsequent processing of splicing the text-audio pairs can be stopped.
  • the text-audio pairs in the training database can be used as sample data (sample text-audio pairs) to train the speech synthesis model in the downstream business.
  • sample text-audio pairs sample text-audio pairs
  • a speech synthesis model is trained based on the sample text segment and the sample audio segment to obtain a target speech synthesis model.
  • a large number of sample text-audio pairs can be extracted from the training database, and then the speech is compared based on the sample text fragments and sample audio fragments in the sample text-audio pairs.
  • the synthesis model is trained until a speech synthesis model that satisfies the training stop condition is obtained, and it is stored as the target speech synthesis model so that text can be converted into audio in the speech synthesis scene. For example, if the text is "I like to watch football games", input the text into the speech synthesis model for processing, and then the audio corresponding to the text can be obtained, and the processing of converting the text into speech can be realized.
  • the multi-degree spliced text-audio pairs corresponding to the spliced text-audio pairs can be selected from the plurality of text-audio pairs according to the audio characteristics, and then the post-splicing detection processing is performed. , until the text-audio pairs that meet the requirements for writing into the training database are obtained.
  • the splicing detection process is continuously performed, the obtained text-audio pairs to be detected may not meet the requirements for writing into the training database.
  • a stop condition can be set. When the number of times of splicing reaches a certain condition, the processing of the text-audio pair can be stopped, and it can be discarded.
  • the specific implementation method is as follows:
  • the multi-degree spliced text-audio pair specifically refers to a text-audio pair that can be spliced with the spliced text-audio pair; based on this, in the case that the to-be-detected text-audio pair does not meet the preset detection condition, Explain that the target text-audio pair and the text-audio pair to be detected after splicing the spliced text-audio pair do not meet the requirements for writing in the training database.
  • the multi-degree to-be-detected text-audio pair can be used as the to-be-detected text-audio pair, and the multi-degree spliced text-audio pair can be used as the spliced text-audio pair, and the process of performing the screening of the multi-degree spliced text-audio pair can be returned, It is sufficient to obtain a text-audio pair that meets the requirements for writing into the training database, or to discard the text-audio pair until the condition for stopping splicing is reached.
  • the obtained first text-audio pair to be detected does not meet the requirements for writing into the training database T
  • the first to-be-detected audio segment in the first to-be-detected text-audio pair is composed of the first audio segment and the second to-be-detected audio segment. It is composed of audio clips
  • the first text clip to be detected is composed of the first text clip and the second text clip, so the third text-audio pair TA 3 that has the possibility of splicing with the second text-audio pair TA 2 can be selected as the multiplicity.
  • the multi-degree to-be-detected audio segment in the multi-degree to-be-detected text-audio pair consists of the first audio segment, the second audio segment and the third audio segment, and the multi-degree to-be-detected text segment in the multi-degree to-be-detected text-audio pair is determined.
  • the multi-duty text-audio pair to be detected is (audio fragment 6s, the text fragment "I saw a wonderful scene"), and the multi-duty waiting Detecting the text-audio pair to be detected, if the multi-degree to-be-detected text-audio pair satisfies the preset detection condition, then the multi-degree to-be-detected text-audio pair can be written into the training database T, if the multi-degree to-be-detected text-audio pair does not satisfy Preset detection conditions, you can then select a text-audio pair with the third text-audio pair TA 3 that has the possibility of splicing and perform splicing and detection processing, or discard the multiple text-audio pairs to be detected, and select other text-audio pairs to carry out the above-mentioned processing. It is enough to obtain sample data that satisfies writing to the training database T.
  • This specification provides a sample generation method, after acquiring multiple text-audio pairs, calculating the audio feature of the audio segment of each text-audio pair in the multiple text-audio pairs, and generating the The target text-audio pair and the spliced text-audio pair corresponding to the target text-audio pair are screened out from the text-audio pairs, and then the target text-audio pair and the spliced-text-audio pair are spliced into the to-be-detected text-audio pair, and Detecting the to-be-detected text-audio pair, and writing the to-be-detected text-audio pair into the training database when the to-be-detected text-audio pair satisfies a preset detection condition, to achieve in the sample data preparation stage, High-quality sample data that meets the needs of downstream business use can be obtained by splicing, which saves the cost of resource consumption in the data preparation stage, and the amount of sample data written into the training database after splicing is large, which effectively solves the problem.
  • the sample generation method is further described below by taking the application of the sample generation method provided in this specification in a speech synthesis scenario as an example with reference to FIG. 5 .
  • 5 shows a processing flow chart of a sample generation method applied in a speech synthesis scenario provided by an embodiment of this specification, which specifically includes the following steps:
  • Step S502 acquiring the target text and the audio corresponding to the target text.
  • This embodiment provides a sample generation method applied in a speech synthesis scenario to solve the above problem.
  • Step S504 preprocess the audio to obtain target audio, and convert the target text into a phoneme sequence.
  • Step S506 aligning the phoneme sequence with the target audio, obtaining a phoneme audio file according to the alignment processing result, and determining the segmentation position of the phoneme audio file.
  • Step S508 segment the phoneme audio file according to the segmentation positions to obtain multiple phoneme audio pairs, and determine the text segment corresponding to the phoneme segment of each phoneme audio pair in the multiple phoneme audio pairs based on the target text.
  • Step S510 Generate a plurality of text-audio pairs according to the text segments corresponding to the phoneme segments in each phoneme-audio pair and the audio segments in each phoneme-audio pair.
  • Step S512 extracting the audio segment of each text-audio pair in the plurality of text-audio pairs, and performing frame segmentation processing on the audio segment of each text-audio pair to obtain an audio frame set of each text-audio pair.
  • Step S514 based on the audio frames included in the audio frame set of each text-audio pair in the plurality of text-audio pairs, calculate the fundamental frequency feature and the audio frame feature of the audio segment of each text-audio pair.
  • Step S516 Integrate the audio segment, text segment, fundamental frequency feature and audio frame feature of each text-audio pair in the multiple text-audio pairs, obtain a text-audio package corresponding to each text-audio pair, and write the segment database.
  • Step S518, select any text-audio package in the segment database as the target text-audio package, and determine the text-audio pair in the target text-audio package as the target text-audio pair.
  • Step S520 selecting text audio packages other than the target text audio package in the segment database to form a set of text audio packages to be screened.
  • Step S522 Determine the text-audio pair of each to-be-screened text-audio package included in the to-be-screened text-audio package set as the to-be-screened text-audio pair.
  • Step S524 Determine the pitch frequency feature and audio frame feature of the audio clip of the target text-audio pair according to the target text-audio package, and determine the pitch frequency feature and audio frame feature of the audio clip of the text-audio pair to be screened according to the text-audio package to be screened.
  • Step S526 Calculate the feature distance based on the pitch frequency feature and audio frame feature of the audio segment of the target text-audio pair, and the pitch frequency feature and audio frame feature of the audio segment of the text-audio pair to be screened.
  • Step S530 taking the text-audio pair in the spliced text-audio package as the spliced-text-audio pair.
  • Step S532 extracting the target text segment and the target audio segment in the target text-audio pair, and the spliced text segment and the spliced audio segment in the spliced text-audio pair.
  • Step S534 splicing the target text segment and the spliced text segment into a text segment to be detected, and splicing the target audio segment and the spliced audio segment into an audio segment to be detected.
  • Step S536 compose a text-audio pair to be detected based on the text segment to be detected and the audio segment to be detected.
  • Step S538 Perform sampling processing on the to-be-detected audio segment in the to-be-detected text-audio pair to obtain the to-be-detected sampling information, and determine to-be-detected text information of the to-be-detected text segment in the to-be-detected text-audio pair.
  • Step S540 in the case that the sample information to be detected and the text information to be detected both meet the preset detection conditions, write the text-audio pair to be detected into the training database.
  • This specification provides a sample generation method, which can achieve high-quality sample data that meets the needs of downstream business use by splicing in the sample data preparation stage, saves the cost of resource consumption in the data preparation stage, and writes the data after splicing.
  • the sample data in the training database has a large amount of data, which effectively solves the problem of poor speech synthesis effect caused by the small amount of downstream business sample data and the uneven distribution of audio lengths in the sample data, thereby improving the business processing efficiency of the downstream business.
  • FIG. 5 shows a schematic structural diagram of a sample generating apparatus provided by an embodiment of the present specification.
  • the device includes:
  • an acquisition module 602 configured to acquire a plurality of text-audio pairs, wherein each text-audio pair includes a text segment and an audio segment;
  • the calculation module 604 is configured to calculate the audio feature of the audio segment of each text-audio pair in the plurality of text-audio pairs, and filter out the target text-audio pair and The spliced text-audio pair corresponding to the target text-audio pair;
  • the splicing module 606 is configured to splicing the target text-audio pair and the splicing text-audio pair into a to-be-detected text-audio pair, and to detect the to-be-detected text-audio pair;
  • the writing module 608 is configured to write the to-be-detected text-audio pair into a training database when the to-be-detected text-audio pair meets a preset detection condition.
  • the obtaining module 602 is further configured to:
  • target text and the audio corresponding to the target text perform preprocessing on the audio to obtain target audio, and convert the target text into a phoneme sequence; align the phoneme sequence and the target audio, and according to The alignment processing results generate the plurality of text-audio pairs.
  • the obtaining module 602 is further configured to:
  • the computing module 604 is further configured to:
  • the computing module 604 is further configured to:
  • Integrate the audio clips, text clips and audio features of each text-audio pair in the multiple text-audio pairs obtain a text-audio package corresponding to each text-audio pair, and write into the fragment database; in the fragment database Select any text audio package as the target text audio package, and determine the text audio pair in the target text audio package as the target text audio pair; based on the text in the fragment database except the target text audio package.
  • the audio package and the audio feature determine the spliced text-audio package, and the text-audio pair in the spliced-text-audio package is used as the spliced-text-audio pair.
  • the computing module 604 is further configured to:
  • Selecting text and audio packets other than the target text and audio packets in the fragment database to form a set of text and audio packets to be screened; is a text-audio pair to be screened; based on the audio features of the audio clips of the target text-audio pair and the audio features of the audio clips of the text-audio pair to be screened, the splicing is screened out in the text-audio package set to be screened Text Audio Pack.
  • the computing module 604 is further configured to:
  • the sample generating device further includes:
  • a sampling module configured to perform sampling processing on the audio clips in the target text-audio pair to obtain target sampling information, and determine the target text information of the text clips in the target text-audio pair; determine the target sampling information and the target sampling information Whether the target text information satisfies the preset detection condition;
  • the stitching module 606 is run.
  • the target text-audio pair is written into the training database.
  • the splicing module 606 is further configured to:
  • the splicing module 606 is further configured to:
  • the writing module 608 is further configured to:
  • the text-audio pair to be detected is written into the training database.
  • the sample generating device further includes:
  • a screening module configured to screen out the corresponding abundances of the spliced text-audio pairs from the plurality of text-audio pairs according to the audio features when the to-be-detected text-audio pairs do not meet the preset detection conditions Splicing text-audio pairs; splicing the multi-degree splicing text-audio pairs and the to-be-detected text-audio pairs into multi-degree to-be-detected text-audio pairs, and judging whether the multi-degree to-be-detected text-audio pairs satisfy the preset detection condition;
  • the sample generating device further includes:
  • a training module configured to extract sample text-audio pairs in the training database, the sample text-audio pairs including sample text fragments and sample audio fragments; based on the sample text fragments and the sample audio fragments, a speech synthesis model Perform training to obtain the target speech synthesis model.
  • the sample generation device after acquiring multiple text-audio pairs, calculates the audio feature of the audio segment of each text-audio pair in the multiple text-audio pairs, and calculates the audio feature of the audio segment in the multiple text-audio pairs according to the audio feature.
  • the target text-audio pair and the spliced text-audio pair corresponding to the target text-audio pair are screened out from the text-audio pairs, and then the target text-audio pair and the spliced-text-audio pair are spliced into the to-be-detected text-audio pair, and Detecting the to-be-detected text-audio pair, and writing the to-be-detected text-audio pair into the training database when the to-be-detected text-audio pair satisfies a preset detection condition, to achieve in the sample data preparation stage, High-quality sample data that meets the needs of downstream business use can be obtained by splicing, which saves the cost of resource consumption in the data preparation stage, and the amount of sample data written into the training database after splicing is large, which effectively solves the problem.
  • the small amount of sample data for downstream services and the uneven distribution of audio lengths in the sample data result in poor speech synthesis effects, thereby improving the business processing efficiency of downstream services.
  • the above is a schematic solution of a sample generating apparatus according to this embodiment. It should be noted that the technical solution of the sample generation device and the technical solution of the above-mentioned sample generation method belong to the same concept, and the details that are not described in detail in the technical solution of the sample generation device can be referred to the description of the technical solution of the above-mentioned sample generation method. .
  • FIG. 7 shows a structural block diagram of a computing device 700 according to an embodiment of the present specification.
  • Components of the computing device 700 include, but are not limited to, memory 710 and processor 720 .
  • the processor 720 is connected to the memory 710 through the bus 730, and the database 750 is used for storing data.
  • Computing device 700 also includes access device 740 that enables computing device 700 to communicate via one or more networks 760 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 740 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as an IEEE 502.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 700 and other components not shown in FIG. 7 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 7 is only for the purpose of example, rather than limiting the scope of the present specification. Those skilled in the art can add or replace other components as required.
  • Computing device 700 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • mobile computers or mobile computing devices eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.
  • mobile phones eg, smart phones
  • wearable computing devices eg, smart watches, smart glasses, etc.
  • desktop computers or PCs e.g., desktop computers or PCs.
  • Computing device 700 may also be a mobile or stationary server.
  • the processor 720 is configured to execute the following computer-executable instructions:
  • each text-audio pair contains a text segment and an audio segment
  • the text-audio pair to be detected satisfies a preset detection condition
  • the text-audio pair to be detected is written into a training database.
  • the above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned sample generation method belong to the same concept, and the details not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned sample generation method.
  • An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, which, when executed by a processor, are used for:
  • each text-audio pair contains a text segment and an audio segment
  • the text-audio pair to be detected satisfies the preset detection condition
  • the text-audio pair to be detected is written into a training database.
  • the above is a schematic solution of a computer-readable storage medium of this embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the above-mentioned sample generation method belong to the same concept, and the details not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned sample generation method.
  • the computer instructions include computer program code, which may be in source code form, object code form, an executable file, some intermediate form, or the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electric carrier signal telecommunication signal and software distribution medium, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本说明书提供样本生成方法及装置,其中所述样本生成方法包括:获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。

Description

样本生成方法及装置
本申请要求于2020年11月20日提交中国专利局、申请号为202011309190.7、发明名称为“样本生成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书涉及数据处理技术领域,特别涉及样本生成方法及装置。
背景技术
随着互联网技术的发展,语音合成被越来越多的场景所应用;语音合成(Text ToSpeech,TTS)又称为文语转换技术,是一种通过软、硬件结合的方式将文本转换为自然语音的技术,该技术可以通过波形拼接、基于参数的合成或使用神经网络的端到端合成方法实现;其中波形拼接方法需要较长时间的训练数据才能够完成语音合成;而基于参数的合成方法虽然可以完成语音合成,但是所参考的因素较少,导致最终的合成结果并不理想;现有技术中应用较为广泛的即为基于神经网络的端到端合成方法,该方法所需要的数据量小,无需人工调整大量参数即可实现语音合成;虽然端到端的语音合成方法在数据量的需求上要小于其他方法,但是结合基于神经网络的端到端合成方法的特性,对语音数据的质量要求要远高于其他方法,前期准备语音数据的成本将大大的增加,并且准备好的语音数据还可能存在不完善的问题,严重影响基于神经网络的端到端合成方法的实现,因此亟需一种有效的方案以解决上述问题。
发明内容
有鉴于此,本说明书实施例提供了一种样本生成方法。本说明书同时涉及一种样本生成装置,一种计算设备,以及一种计算机可读存储介质,以解决现有技术中存在的技术缺陷。
根据本说明书实施例的第一方面,提供了一种样本生成方法,包括:
获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本 音频对写入训练数据库。
可选地,所述获取多个文本音频对,包括:
获取目标文本以及所述目标文本对应的音频;
对所述音频进行预处理获得目标音频,并将所述目标文本转换为音素序列;
将所述音素序列与所述目标音频进行对齐处理,并根据对齐处理结果生成所述多个文本音频对。
可选地,所述根据对齐处理结果生成所述多个文本音频对,包括:
根据对齐处理结果得到音素音频文件,并确定所述音素音频文件的切分位置;
按照所述切分位置对所述音素音频文件进行切分,获得多个音素音频对,其中每个音素音频对中包含音素片段和音频片段;
基于所述目标文本确定所述多个音素音频对中的每个音素音频对的音素片段对应的文本片段;
根据每个音素音频对中音素片段对应的文本片段,以及每个音素音频对中的音频片段生成所述多个文本音频对。
可选地,所述计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,包括:
提取所述多个文本音频对中每个文本音频对的音频片段,并对每个文本音频对的音频片段进行分帧处理,获得每个文本音频对的音频帧集合;
基于所述多个文本音频对中每个文本音频对的音频帧集合包含的音频帧,计算每个文本音频对的音频片段的基音频率特征和音频帧特征;
根据每个文本音频对的音频片段的所述基音频率特征和所述音频帧特征,确定每个文本音频对的音频片段的所述音频特征。
可选地,所述根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对,包括:
将所述多个文本音频对中每个文本音频对的音频片段、文本片段和音频特征进行整合,获得每个文本音频对对应的文本音频包,并写入片段数据库;
在所述片段数据库中选择任意一个文本音频包作为目标文本音频包,并将所述目标文本音频包中的文本音频对确定为所述目标文本音频对;
基于所述片段数据库中除所述目标文本音频包外的文本音频包和所述音频特征确定拼接文本音频包,并将所述拼接文本音频包中的文本音频对作为所述拼接文本音频对。
可选地,所述基于所述片段数据库中除所述目标文本音频包外的文本音频包和所述音频特征确定拼接文本音频包,包括:
在所述片段数据库中选择除所述目标文本音频包外的文本音频包组成待筛选文本音频包集合;
将所述待筛选文本音频包集合中包含的各个待筛选文本音频包的文本音频对确定为待筛选文本音频对;
基于所述目标文本音频对的音频片段的音频特征和所述待筛选文本音频对的音频片段的音频特征,在所述待筛选文本音频包集合中筛选出所述拼接文本音频包。
可选地,所述基于所述目标文本音频对的音频片段的音频特征和所述待筛选文本音频对的音频片段的音频特征,在所述待筛选文本音频包集合中筛选出所述拼接文本音频包,包括:
根据所述目标文本音频包确定所述目标文本音频对的音频片段的第一音频特征,以及根据所述待筛选文本音频包确定所述待筛选文本音频对的音频片段的第二音频特征;
计算所述第一音频特征和所述第二音频特征之间的特征距离;
将所述特征距离小于预设距离阈值的待筛选文本音频对所属的待筛选文本音频包确定为所述拼接文本音频包。
可选地,所述将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测步骤执行之前,还包括:
对所述目标文本音频对中的音频片段进行采样处理获得目标采样信息,以及确定所述目标文本音频对中的文本片段的目标文本信息;
判断所述目标采样信息和所述目标文本信息是否满足所述预设检测条件;
若否,则执行将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测的步骤。
可选地,若所述判断所述采样信息和所述文本信息是否满足所述预设检测条件的判断结果为是,则将所述目标文本音频对写入所述训练数据库。
可选地,所述将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,包括:
提取所述目标文本音频对中的目标文本片段和目标音频片段,以及所述拼接文本音频对中的拼接文本片段和拼接音频片段;
将所述目标文本片段和所述拼接文本片段拼接为待检测文本片段,以及将 所述目标音频片段和所述拼接音频片段拼接为待检测音频片段;
基于所述待检测文本片段和所述待检测音频片段组成所述待检测文本音频对。
可选地,所述对所述待检测文本音频对进行检测,包括:
对所述待检测音频片段进行采样处理获得待检测采样信息,以及确定所述待检测文本片段的待检测文本信息;
基于所述预设检测条件对所述待检测采样信息和所述待检测文本信息进行检测;
相应的,所述在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库,包括:
在所述待检测采样信息和所述待检测文本信息均满足所述预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库。
可选地,所述将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测步骤执行之后,还包括:
在所述待检测文本音频对未满足预设检测条件的情况下,根据所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对;
将所述多度拼接文本音频对和所述待检测文本音频对拼接为多度待检测文本音频对,并判断所述多度待检测文本音频对是否满足所述预设检测条件;
若是,则将所述多度待检测文本音频对写入所述训练数据库;
若否,则将所述多度拼接文本音频对作为所述拼接文本音频对,以及将所述多度待检测文本音频对作为所述待检测文本音频对,并执行所述根据所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对步骤。
可选地,所述将所述待检测文本音频对写入训练数据库步骤执行之后,还包括:
在所述训练数据库中提取样本文本音频对,所述样本文本音频对中包含样本文本片段和样本音频片段;
基于所述样本文本片段和所述样本音频片段对语音合成模型进行训练,获得目标语音合成模型。
根据本说明书实施例的第二方面,提供了一种样本生成装置,包括:
获取模块,被配置为获取多个文本音频对,其中每个文本音频对中包含文 本片段和音频片段;
计算模块,被配置为计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
拼接模块,被配置为将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
写入模块,被配置为在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
根据本说明书实施例的第三方面,提供了一种计算设备,包括:
存储器和处理器;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令:
获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
根据本说明书实施例的第四方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述样本生成方法的步骤。本说明书提供一种样本生成方法,在获取到多个文本音频对后,计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对,以及所述目标文本音频对对应的拼接文本音频对,之后将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测,在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库,实现在样本数据准备阶段,可以通过拼接的方式得到质量高且满足下游业务使用需求的样本数据,节省数据准备阶段的资源消耗成本,并且拼接后写入所述训练数据库中的样本数据的数据量较多,有效的解决了下游业务样本数据量少、样本数据中音频长短分布不均匀导致语音合成效果不佳的问题,从而提高下游业务的业务处理效率。
附图说明
图1是本说明书一实施例提供的一种样本生成方法的流程图;
图2是本说明书一实施例提供的一种样本生成方法中对齐处理结果的示意图;
图3是本说明书一实施例提供的一种样本生成方法中切分处理结果的示意图;
图4是本说明书一实施例提供的一种样本生成方法中筛选拼接文本音频对的流程图;
图5是本说明书一实施例提供的一种应用于语音合成场景中的样本生成方法的流程图;
图6是本说明书一实施例提供的一种样本生成装置的结构示意图;
图7是本说明书一实施例提供的一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本说明书。但是本说明书能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本说明书内涵的情况下做类似推广,因此本说明书不受下面公开的具体实施的限制。
在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本说明书一个或多个实施例涉及的名词术语进行解释。
F0(基音频率):一般的声音都是由发音体发出的一系列频率、振幅各不相同的振动复合而成的;这些振动中有一个频率最低的振动,由它发出的音就是基音,其对应的频率即为基音频率。
强制对齐:一种得到给定音素的序列和语音在时间上的对应关系的技术,可以通过强制对齐工具实现,如通过kaldi(一种开源语音识别工具(Toolkit),它使用WFST来实现解码算法)或HTK(HMM Toolkit,一款基于hmm模型的语音处理工具)等即可实现音素序列和音频的对齐。
音素,是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素分为元音与辅音两大类。如汉语音节啊(ā)只有一个音素,爱(ài)有两个音素,代(dài)有三个音素等;在汉语中,音素即为拼音;在英语中,音素即为音标。
在本说明书中,提供了一种样本生成方法,本说明书同时涉及一种样本生成装置,一种计算设备,以及一种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
实际应用中,在基于神经网络的端到端语音合成方法中,由于该方法的特性,需要在模型训练前准备部分质量较高的样本数据,以实现训练出满足使用需求的语音合成模型;而这部分样本数据的往往需要在专业的录音棚中进行录制后,再进行修剪和整理后才能够用于训练模型,不仅需要消耗较多的时间才能够完成数据的准备,而且成本消耗上也是一笔较大的开销;同时由于样本数据的要求较为严格,导致最终能够用于训练模型的数据少之更少,进而无法得到覆盖长度和韵律较为全面的样本数据,从而影响语音合成时的音色不像、韵律(音调起伏)不自然等问题。因此在样本数据的准备阶段如何生成质量高且属性较为丰富的样本数据是亟需解决的问题。
本说明书提供一种样本生成方法,在获取到多个文本音频对后,计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对,以及所述目标文本音频对对应的拼接文本音频对,之后将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测,在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库,实现在样本数据准备阶段,可以通过拼接的方式得到质量高且满足下游业务使用需求的样本数据,节省数据准备阶段的资源消耗成本,并且拼接后写入所述训练数据库中的样本数据的数据量较多,有效的解决了下游业务样本数据量少、样本数据中音频长短分布不均匀导致语音合成效果不佳的问题,从而提高下游业务的业务处理效率。
图1示出了根据本说明书一实施例提供的一种样本生成方法的流程图,具体包括以下步骤:
步骤S102,获取多个文本音频对,其中每个文本音频对中包含文本片段 和音频片段。
具体的,所述文本音频对具体是指具有对应关系的文本片段和音频片段组成的队列,所述文本片段包括但不限于字单元、词单元或句单元等,所述音频片段包括但不限于与字单元,词单元或句单元匹配的语音。
基于此,由于下游业务的处理过程为训练语音合成模型,即训练出能够将文本转换为音频的模型,因此需要在样本数据的准备阶段准备出大量满足下游业务使用需求的样本数据,并且为了能够训练出预测精准度较高的语音合成模型,将在数据准备阶段结合音频特征进行样本数据的准备,从而实现在较低的消耗成本下完成高质量、高数量的样本数据准备工作。
进一步的,由于下游业务对样本数据的需求量较大且质量要求较高,如果只是通过人工录制音频的方式构建样本数据将消耗较多的时间,并且属性覆盖范围较小,本申请为了解决该问题,采用拼接的方式实现样本数据的准备工作,实现可以拼接出大量的样本数据用于下游业务,并且为了保证样本数据的质量要求,将结合音频特征完成拼接处理,从而完成样本数据的准备工作;基于此,在样本数据的准备阶段,可以准备长短不一的少量文本,并生成这部分文本对应的音频,之后基于少量的文本和其对应的少量的音频进行写入训练数据库中样本数据的构建。
本实施例将以少量文本中的任意一个文本作为目标文本为例,对所述样本生成方法进行描述,生成满足需求的样本数据过程均可参见本实施例相应的描述内容,在此不作过多赘述。
在生成所述样本数据的过程中,为了能够拼接出大量可以写入所述训练数据库的文本音频对,将对获取到的目标文本及其对应的音频进行切分后对齐,从而得到所述多个文本音频对,本实施例中,具体实现方式如下所述:
获取目标文本以及所述目标文本对应的音频;
对所述音频进行预处理获得目标音频,并将所述目标文本转换为音素序列;
将所述音素序列与所述目标音频进行对齐处理,并根据对齐处理结果生成所述多个文本音频对。
具体的,所述目标文本包括但不限于一篇文章或一段语句等,相应的,所述音频具体是指针对所述目标文本生成的语音,所述目标文本对应的音频可以采用录制的方式或语音合成的方式生成,本实施例在此不作任何限定,需要说明的是,所述音频与所述目标文本的匹配度较高,以保证后续拼接时可以得到较多能够写入所述训练数据库的样本数据;所述目标音频具体是指对所述音频进行标准化处理后得到的音频,所述音素序列具体是指构成所述目标文本的最 小单位组成的序列,所述对齐处理具体是指在音频中找到文本对应的时间区间。
基于此,为了能够保证文本音频对中的文本片段和音频片段相近程度较高,将在生成所述文本音频对时,从文本的最小单位出发完成对齐处理过程,即在获取到所述目标文本以及目标文本对应的音频之后,首先对所述音频进行预处理得到目标音频,实现去掉所述音频中对后续处理过程会造成干扰的部分,如音频起始和/或末尾处的空白音频片段(未发声的音频片段)或音频起始和/或末尾处的噪声较大的音频片段(无法分辨该音频片段的发音内容)等;其次将所述目标文本转换为音素序列,实现通过最小单位完成对齐文本和音频,提高对齐精准度;最后将所述音素序列和所述目标音频进行对齐处理,根据对齐处理结果即可得到所述多个文本音频对。
实际应用中,在将所述音素序列和所述目标音频进行对齐处理的过程中,可以使用kaldi对齐工具或者HTK对齐工具完成;此外还可以根据实际需求选择其他对齐工具完成所述音素序列与所述目标音频的对齐,本实施例在此不作任何限定。
进一步的,在根据对齐处理结果生成所述多个文本音频对的过程中,由于是从文本的最小单位完成对齐处理过程的,因此在完成音素与音频的对齐处理后,还需要将音素转换为文本,从而实现文本和音频精准的对齐,并经过切分以得到满足后续使用需求的多个文本音频对,本实施例中,具体实现方式如下所述:
根据对齐处理结果得到音素音频文件,并确定所述音素音频文件的切分位置;
按照所述切分位置对所述音素音频文件进行切分,获得多个音素音频对,其中每个音素音频对中包含音素片段和音频片段;
基于所述目标文本确定所述多个音素音频对中的每个音素音频对的音素片段对应的文本片段;
根据每个音素音频对中音素片段对应的文本片段,以及每个音素音频对中的音频片段生成所述多个文本音频对。
具体的,所述音素音频文件具体是指音素序列和目标音频经过对齐后得到的文件;所述切分位置可以是所述目标音频中的断句位置或发音中断时间超过设定时间阈值的位置;所述音素音频对具体是指由具有对应关系的音素片段和音频片段组成的队列。
基于此,在完成所述音素序列和所述目标音频的对齐之后,将得到所述音素音频文件,为了能够在后续拼接出大量的满足写入训练数据库的样本数据, 将根据所述音素音频文件中目标音频的切分位置对所述音素音频文件进行切分处理,获得多个音素音频对,其中每个音素音频对中包含音素片段和其对应的音频片段,之后再基于所述目标文本将每个音素音频对的音素片段转换为文本片段,从而实现根据每个音素音频对中音素片段对应的文本片段和音频片段组成所述文本音频对,所述文本音频对中包含文本片段和其对应的音频片段。此时形成的多个文本音频对,即可实现在后续处理时拼接出用于写入训练数据库的样本数据,完成样本数据的准备工作。
具体实施时,由于所述目标音频已经与所述音频序列完成精准的对齐,因此在按照所述目标音频的切分位置对所述音素音频文件进行切分时,可以实现被切分出的音素音频对中包含的音素片段和音频片段也是相对应的;并且根据用户讲话的特性,被切分后的音素音频对中所包含的音素片段可以保证在目标文本中找到相应的文本片段,不会出现音素片段被切分后不完整的问题。
例如,目标文本为“我看了一场精彩的足球比赛”,并针对该目标文本生成了12s的音频,为了能够满足后续对齐处理,将删除该音频起始处和末尾处的空白音频片段,得到时间长度为10s的目标音频,同时为了提高对齐精准度,可以转换出目标文本“我看了一场精彩的足球比赛”对应的音素序列(wo kan le yi chang jing cai de zu qiu bi sai),并通过kaldi对齐工具对音素序列和目标音频进行对齐处理,得到如图2所示的对齐处理结果,即由音素序列和目标音频组成的音素音频文件。
进一步的,通过对音素音频文件中目标音频的检测,确定目标音频中讲话的用户在录制这段语音时断句四次,第一次断句在目标音频的3s处,第二次断句在目标音频的4s处,第三次断句在目标音频的6s处,第四次断句在目标音频的8s处,此时即可确定对音素音频文件的切分位置分别为T 1=3,T 2=4,T 3=6和T 4=8处,按照切分位置对音素音频文件进行切分,得到五个音素音频对,第一音素音频对P 1由第一音素片段(wo kan le)和第一音频片段(0s~3s)组成;第二音素音频对P 2由第二音素片段(yi chang)和第二音频片段(3s~4s)组成;第三音素音频对P 3由第三音素片段(jing cai de)和第三音频片段(4s~6s)组成;第四音素音频对P 4由第四音素片段(zu qiu)和第四音频片段(6s~8s)组成;第五音素音频对P 5由第五音素片段(bi sai)和第五音频片段(8s~10s)组成。
更进一步的,在得到音素音频对P 1~P 5之后,还需要将各个音素音频对中的音素片段转换为文本片段,从而得到能够用于后续拼接处理的文本音频对,此时根据目标文本“我看了一场精彩的足球比赛”即可确定各个音素音频对中音素片段对应的文本片段,即第一音素音频对P 1中包含的第一音素片段(wo  kan le)对应的第一文本片段为(我看了);第二音素音频对P 2中包含的第二音素片段(yi chang)对应的第二文本片段为(一场);第三音素音频对P 3中包含的第三音素片段(jing cai de)对应的第三文本片段为(精彩的);第四音素音频对P 4中包含的第四音素片段(zu qiu)对应的第四文本片段为(足球);第五音素音频对P 5中包含的第五音素片段(bi sai)对应的第五文本片段为(比赛)。
最后根据上述得到的文本片段和音频片段即可生成目标文本和目标音频对应的多个文本音频对,如图3所示的切分结果,其中,第一文本音频对TA 1由第一文本片段(我看了)和第一音频片段(0s~3s)组成;第二文本音频对TA 2由第二文本片段(一场)和第二音频片段(3s~4s)组成;第三文本音频对TA 3由第三文本片段(精彩的)和第三音频片段(4s~6s)组成;第四文本音频对TA 4由第四文本片段(足球)和第四音频片段(6s~8s)组成;第五文本音频对TA 5由第五文本片段(比赛)和第五音频片段(8s~10s)组成;以用于后续拼接出满足写入训练数据库的样本数据,实现在训练语音合成模型时使用。
综上,在构建所述多个文本音频对时,通过从最小单位音素完成与目标音频的对齐,不仅可以提高文本和音频对齐的精准度,还能够在后续进行切分时,保证被切分后的音素片段与音频片段的匹配程度是较高的,从而实现切分出的多个文本音频对都可以用于后续的样本生成过程,为后续生成样本数据提供充足的数量保障。
步骤S104,计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对。
具体的,在上述获取到所述多个文本音频对的基础上,进一步的,由于写入所述训练数据库中的文本音频对是为了训练模型使用,因此为了提高训练出的模型的预测精准度,还需要保证训练模型所使用的样本数据的质量,即在拼接出能够写入训练数据库的文本音频对时,还需要考虑拼接前的文本音频对之间的音色和韵律等问题;如果拼接前的两个文本音频对在音色和韵律等方面都不相同或相近,或音调起伏不一致,那么拼接出的文本音频对就存在着音频片段不匹配的问题,以及文本片段前后文语义不一致的问题,就无法用于训练模型。
基于此,为了能够拼接出高质量的文本音频对(能够写入训练数据库的样本数据),以训练出预测精准度较高的模型,本申请在拼接文本音频对前,将计算各个文本音频对中的音频片段的音频特征,之后基于该音频特征在多个文本音频对中选择出能够拼接的文本音频对,实现将音调、韵律等属性相近的文 本音频对进行拼接,以得到音频片段连续、文本片段语义一致的文本音频对,从而得到高质量的文本音频对,以用于后续训练模型使用。
其中,所述音频特征包括但不限于表征音频片段的基音频率的特征,音频帧的特征和/或音频帧能量的特征等,通过文本音频对中音频片段的音频特征即可分析出各个需要拼接的文本音频对是否适合拼接,即通过基音频率特征、音频帧的特征和/或音频帧能量特征,确定需要拼接的文本音频对之间的音调、韵律等属性是否相近或相同,实现通过所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对;所述目标文本音频对具体是指基准文本音频对,所述拼接文本音频对即为满足与所述基准文本音频对进行拼接条件的文本音频对。
基于此,在得到所述多个文本音频对之后,为了能够得到能够相互拼接的文本音频对(即文本音频对之间音色、韵律相近或相同),以生成较多的样本数据,将计算各个文本音频对中音频片段的音频特征,并且在确定所述目标文本音频对之后,将基于目标文本音频对中音频片段的音频特征和多个文本音频对中各个文本音频对的音频片段的音频特征,在所述多个文本音频对中筛选出与所述目标文本音频对对应的拼接文本音频对,用于后续生成样本数据,实现在拼接出大量的样本数据时,不仅满足了样本数据的数量要求,还能够结合音频特征保证拼接前文本音频对之间的相近程度,进而提高拼接后的文本音频对的质量。
进一步的,在计算每个文本音频对的音频片段的音频特征的过程中,为了能够通过音频特征充分的反映出每个文本音频对中音频片段的属性特征,可以对每个文本音频对中音频片段进行分帧处理,通过音频帧分析出所述音频特征,本实施例中,具体实现方式如下所述:
提取所述多个文本音频对中每个文本音频对的音频片段,并对每个文本音频对的音频片段进行分帧处理,获得每个文本音频对的音频帧集合;
基于所述多个文本音频对中每个文本音频对的音频帧集合包含的音频帧,计算每个文本音频对的音频片段的基音频率特征和音频帧特征;
根据每个文本音频对的音频片段的所述基音频率特征和所述音频帧特征,确定每个文本音频对的音频片段的所述音频特征。
具体的,所述基音频率特征具体是指音频片段中发音体发出一系列频率、振幅各不相同的振动中,频率最低的振动所对应的频率数值;所述音频帧特征具体是指所述音频片段中音频帧经过傅里叶变换后在频谱上的点经过计算得到帧能量数值;相应的,所述基音频率特征可以用于分析文本音频对在进行拼 接时,彼此之间的发音振动幅度是否相近或相同;所述音频帧特征可以用于分析文本音频对在进行拼接时,彼此之间的能量分布是否相近或相同;以实现通过基音频率和帧能量选择拼接后效果较好的文本音频对进行拼接,以得到满足使用需求的样本数据。
基于此,首先提取所述每个文本音频对的音频片段,并对每个文本音频对的音频片段进行分帧处理,获得每个文本音频对的音频帧集合,其次基于音频帧集合中包含的音频帧计算每个文本音频对中音频片段的基音频率特征和音频帧特征,最后根据每个文本音频对的音频片段的基音频率特征和音频帧特征,即可确定每个文本音频对的音频片段的音频特征。
并且,由于任意两个文本音频对都存在拼接的可能,为了能够挖掘出更多可以写入训练数据库中的文本音频对,可以在计算所述音频特征时,计算每个文本音频对的音频片段的起始音频特征(起始基音频率特征和起始音频帧特征)和结束音频特征(结束基音频率特征和结束音频帧特征),之后在从所述多个文本音频对中筛选与所述目标文本音频对对应的拼接文本音频对时,可以将所述目标文本音频对作为起始文本音频对,之后基于目标文本音频对中音频片段的结束音频特征和每个文本音频对中音频片段的起始音频特征,经过计算后筛选出所述拼接文本音频对,此后再对目标文本音频对和所述拼接文本音频对进行拼接时,则可以将所述目标文本音频对作为起始文本音频对,所述拼接文本音频对作为结束文本音频对,将二者按照先后顺序进行拼接,以得到后续需要进行检测的待检测文本音频对。
还可以将所述目标文本音频对作为结束文本音频对,之后基于目标文本音频对中音频片段的起始音频特征和每个文本音频对中音频片段的结束音频特征,经过计算后筛选出所述拼接文本音频对,此后再对目标文本音频对和所述拼接文本音频对进行拼接时,则可以将所述目标文本音频对作为结束文本音频对,所述拼接文本音频对作为起始文本音频对,将二者按照先后顺序进行拼接,以得到后续需要进行检测的待检测文本音频对;并且在此过程中,已经将目标文本音频对作为起始文本音频对和结束文本音频对与其他文本音频进行了可能性的拼接,当对其他文本音频对进行拼接时,就可以省略掉与目标文本音频对拼接的处理过程,以提高后续拼接处理过程中的处理效率。
具体实施时,计算所述基音频率特征可以通过时域估计法实现,即直接由音频波形来估计基音频率,如自相关法、并行处理法、平均幅度差法或数据减少法;还可以通过变换法实现,即将音频的语音信号变换到频域或者时域来估计基音频率,首先利用同态分析方法将声道的影响消除,得到属于激励部分的信息,然后再进行基音频率的计算,如倒谱法;再或者可以通过混合方法实现, 即先提取信号声道模型参数,然后利用它对信号进行滤波,得到音源序列,最后再利用自相关法或者平均幅度差法计算所述基音频率;计算所述音频片段的基音频率特征可以根据实际应用场景选择合适的方法实现,本实施例在此不作任何限定。
计算所述音频帧特征可以通过对每帧音频进行傅里叶变换,得到音频片段对应的频谱,之后统计频谱中各点的取值,并进行平方后相加即可得到各个音频帧的能量,最后再取平均值即可得到所述音频帧特征;或者将音频帧在复数域上的长度相加,得到每帧音频对应的帧能量,最后再取平均值即可得到所述音频帧特征;计算所述音频片段的音频帧特征可以根据实际应用场景选择合适的方法实现,本实施例在此不作任何限定。
此外,在对所述每个文本音频对中音频片段进行分帧时,可以按照固定的长度帧长进行分帧处理,如按照32ms或64ms进行分帧,具体帧长可以根据实际需求进行设定,本实施在此不作任何限定。
沿用上例,在得到文本音频对TA 1~TA 5的基础上,进一步的,为了后续能够拼接出大量满足质量要求的样本数据,可以提前计算各个文本音频对中音频片段的起始基音频率和起始帧能量,以及结束基音频率和结束帧能量;基于此,首先提起各个文本音频对中的音频片段,并对各个音频片段进行分帧处理,得到五个音频帧集合,分别对应文本音频对TA 1~TA 5;之后按照变换法和五个音频帧集合,计算出第一音频片段的起始基音频率F0 s1=N s1,结束基音频率F0 e1=N e1;第二音频片段的起始基音频率F0 s2=N s2,结束基音频率F0 e2=N e2;第三音频片段的起始基音频率F0 s3=N s3,结束基音频率F0 e3=N e3;第四音频片段的起始基音频率F0 s4=N s4,结束基音频率F0 e4=N e4;第五音频片段的起始基音频率F0 s5=N s5,结束基音频率F0 e5=N e5;同时按照傅里叶变换后统计频谱中点取值平方和的方法和五个音频帧集合,计算出第一音频片段的起始帧能量E s1=M s1,结束帧能量E e1=M e1;第二音频片段的起始帧能量E s2=M s2,结束帧能量E e2=M e2;第三音频片段的起始帧能量E s3=M s3,结束帧能量E e3=M e3;第四音频片段的起始帧能量E s4=M s4,结束帧能量E e4=M e4;第五音频片段的起始帧能量E s5=M s5,结束帧能量E e5=M e5
进一步的,根据各个音频片段的起始/结束基音频率和起始/结束帧能量即可确定各个音频片段对应的音频特征,得出第一音频片段的起始音频特征为(F0 s1=N s1,E s1=M s1),结束音频特征为(F0 e1=N e1,E e1=M e1);第二音频片段的起始音频特征为(F0 s2=N s2,E s2=M s2),结束音频特征为(F0 e2=N e2,E e2=M e2);第三音频片段的起始音频特征为(F0 s3=N s3,E s3=M s3),结束音频特征为(F0 e3=N e3,E e3=M e3);第四音频片段的起始音频特征为(F0 s4=N s4,E s4=M s4), 结束音频特征为(F0 e4=N e4,E e4=M e4);第五音频片段的起始音频特征为(F0 s5=N s5,E s5=M s5),结束音频特征为(F0 e5=N e5,E e5=M e5);以用于后续在筛选拼接文本音频对时,可以根据音频特征筛选出拼接效果较高的文本音频对进行拼接处理。
除此之外,在计算每个文本音频对中音频片段的音频特征时,为了能够提高计算效率,还可以按照前向连接和后向连接的方式实现高效计算,即在确定需要计算任意一个音频片段的音频特征时,可以选择该音频片段的前面连接的音频片段和后面连接的音频片段组成相邻音频片段,对任意一个音频片段和相邻音频片段中的两个音频片段同时计算各自的音频特征,以实现节省计算音频特征所消耗的时间,提供训练数据库更新的效率。
综上,为了能够在后续拼接出高质量的样本数据,将提前计算出各个文本音频对中音频片段的音频特征,实现在属性维度分析各文本音频对中音频片段的属性信息,从而在筛选与目标文本音频对对应的拼接文本音频对时,可以结合音频特征筛选出拼接后效果较好的文本音频对作为所述拼接文本音频对,以提高样本数据的质量。
更进一步的,在计算完成所述文本音频对中音频片段的音频特征之后,将根据音频特征在所述文本音频对中筛选出目标文本音频对和所述拼接文本音频对,以用于后续的拼接处理,以得到满足写入需求的样本数据,本实施例中,具体实现方式如图4所示的步骤S1042~步骤S1052。
步骤S1042,将所述多个文本音频对中每个文本音频对的音频片段、文本片段和音频特征进行整合,获得每个文本音频对对应的文本音频包,并写入片段数据库;
步骤S1044,在所述片段数据库中选择任意一个文本音频包作为目标文本音频包,并将所述目标文本音频包中的文本音频对确定为所述目标文本音频对;
步骤S1046,在所述片段数据库中选择除所述目标文本音频包外的文本音频包组成待筛选文本音频包集合;
步骤S1048,将所述待筛选文本音频包集合中包含的各个待筛选文本音频包的文本音频对确定为待筛选文本音频对;
步骤S1050,基于所述目标文本音频对的音频片段的音频特征和所述待筛选文本音频对的音频片段的音频特征,在所述待筛选文本音频包集合中筛选出所述拼接文本音频包;
步骤S1052,将所述拼接文本音频包中的文本音频对作为所述拼接文本音频对。
具体的,所述文本音频包具体是指写入片段数据库的文本音频对和其对应的文本特征组成的集合,所述片段数据库具体是指临时存储文本音频对中文本片段、音频片段和其对应的音频特征的数据库,当获得所述多个文本音频对之后,由于后续针对所述目标文本音频对筛选关联的拼接文本音频对时还需要消耗一定时间,因此可以将所述文本音频包写入所述片段数据库,当需要进行拼接处理时,再从所述片段数据库中提取文本音频对进行后续的拼接处理即可。
进一步的,所述待筛选文本音频包集合中包含的待筛选文本音频包具体是指片段数据库中除所述目标文本音频包之外的其他文本音频包,相应的,所述待筛选文本音频对即为所述待筛选文本音频包中包含的文本音频对;所述拼接文本音频包具体是指可以与所述目标文本音频包进行拼接的文本音频对所属的文本音频包。
基于此,首先将所述多个文本音频对中每个文本音频的音频片段、文本片段和音频特征进行整合,得到多个文本音频包,并临时写入所述片段数据库;之后在需要进行文本音频对拼接时,从所述片段数据库中选择任意一个文本音频包作为目标文本音频包,并提取所述目标文本音频包中包含的文本音频对作为所述目标文本音频对,同时再从所述片段数据库中选择除所述目标文本音频包之外的其他文本音频包作为待筛选文本音频包,并组成待筛选文本音频包集合。
其次,提取所述待筛选文本音频包集合包含的各个待筛选文本音频包中的文本音频对作为所述待筛选文本音频对,并根据各个待筛选文本音频包中整合的音频特征确定各个待筛选文本音频对中音频片段的音频特征。
最后,基于目标文本音频对中音频片段的音频特征和各个待筛选文本音频对中音频片段的音频特征,即可计算出目标文本音频对与各个待筛选文本音频对的匹配程度,之后选择匹配程度较高的待筛选文本音频对所属的文本音频包作为所述拼接文本音频包即可,即将所述拼接文本音频包中的文本音频对作为与所述目标文本音频对对应的拼接文本音频对,以用于后续对二者进行拼接,得到满足写入训练数据库需求的样本数据。
更进一步的,在基于目标文本音频对中音频片段的音频特征和待筛选文本音频对中音频片段的音频特征,筛选所述拼接文本音频包的过程中,为了能够筛选出与所述目标文本音频对匹配度较高的文本音频对,本实施例中,可以采用如下的方式得到拼接文本音频包,从而将所述拼接文本音频包中的文本音频对作为所述拼接文本音频对,以用于后续与目标文本音频对进行拼接,得到满足写入训练数据库的样本数据,具体实现方式如下所述:
根据所述目标文本音频包确定所述目标文本音频对的音频片段的第一音频特征,以及根据所述待筛选文本音频包确定所述待筛选文本音频对的音频片段的第二音频特征;
计算所述第一音频特征和所述第二音频特征之间的特征距离;
将所述特征距离小于预设距离阈值的待筛选文本音频对所属的待筛选文本音频包确定为所述拼接文本音频包。
具体的,所述第一音频特征即为所述目标文本音频对中音频片段的音频特征,所述第二音频特征即为所述待筛选文本音频对中音频片段的音频特征,相应的,所述特征距离具体是指评价文本音频对之间匹配程度的数值,所述特征距离越大,表明所述文本音频对之间的匹配程度越低,反之,所述特征距离越小,表明所述文本音频对之间的匹配程度越高。
基于此,在确定所述述目标文本音频对的音频片段的第一音频特征,,以及所述待筛选文本音频对的第二音频片段的音频特征的基础上,可以根据所述第一音频特征和所述第二音频特征,计算出所述目标文本音频对与各个待筛选文本音频对之间的特征距离,之后选择特征距离小于预设距离阈值的待筛选文本音频对作为所述拼接文本音频对即可,以用于后续拼接处理。
在计算所述特征距离的过程中,可以采用如下公式(1)实现:
L=(F0 e-F0 s) 2+(E e-E s) 2      (1)
其中,L表示特征距离,F0 e表示目标文本音频对中音频片段的结束基音频率特征,F0 s表示待筛选文本音频对中音频片段的起始基音频率特征,E e表示目标文本音频对中音频片段的结束音频帧特征,E s表示待筛选文本音频对中音频片段的起始音频帧特征。
沿用上例,在得到文本音频对TA 1~TA 5,以及各个文本音频对中音频片段的音频特征之后,可以将文本音频对及其对应的音频特征整合为文本音频包(TP 1~TP 5)写入片段数据库D中,以用于后续拼接时可以从中选择文本音频包进行拼接处理。
进一步的,在片段数据库D选择文本音频包TP 1作为目标文本音频包,此时确定文本音频包TP 1中包含的文本音频对TA 1为目标文本音频对;同时在片段数据库D中选择文本音频包TP 2、TP 3、TP 4、TP 5作为待筛选文本音频包,则将各个待筛选文本音频包中的文本音频对TA 2、TA 3、TA 4、TA 5作为待筛选文本音频对,并且根据文本音频包TP 1可以确定目标文本音频对TA 1的音频特征为[(F0 s1=N s1,E s1=M s1),(F0 e1=N e1,E e1=M e1)];根据文本音频包TP 2可以确定待筛选文本音频对TA 2的音频特征为[(F0 s2=N s2,E s2=M s2),(F0 e2=N e2, E e2=M e2)];根据文本音频包TP 3可以确定待筛选文本音频对TA 3的音频特征为[(F0 s3=N s3,E s3=M s3),(F0 e3=N e3,E e3=M e3)];根据文本音频包TP 4可以确定待筛选文本音频对TA 4的音频特征为[(F0 s4=N s4,E s4=M s4),(F0 e4=N e4,E e4=M e4)];根据文本音频包TP 5可以确定待筛选文本音频对TA 5的音频特征为[(F0 s5=N s5,E s5=M s5),(F0 e5=N e5,E e5=M e5)]。
更进一步的,根据上述公式(1)计算目标文本音频对与各个待筛选文本音频对的特征距离,确定目标文本音频对TA 1与待筛选文本音频对TA 2的特征距离L 1=(F0 e1-F0 s2) 2+(E e1-E s2) 2=(N e1-N s2) 2+(M e1-M s2) 2;确定目标文本音频对TA 1与待筛选文本音频对TA 3的特征距离L 2=(F0 e1-F0 s3) 2+(E e1-E s3) 2=(N e1-N s3) 2+(M e1-M s3) 2;确定目标文本音频对TA 1与待筛选文本音频对TA 4的特征距离L 3=(F0 e1-F0 s4) 2+(E e1-E s4) 2=(N e1-N s4) 2+(M e1-M s4) 2;确定目标文本音频对TA 1与待筛选文本音频对TA 5的特征距离L 4=(F0 e1-F0 s5) 2+(E e1-E s5) 2=(N e1-N s5) 2+(M e1-M s5) 2
最后,由于特征距离越小表明目标文本音频对与待筛选文本音频对的匹配程度越高,故在将特征距离L 1~L 4分别与预设距离阈值L T进行比较时,选择小于距离阈值L T的待筛选文本音频对作为能够与目标文本音频对TA 1拼接的拼接文本音频对,通过比较结果确定特征距离L 1、L 3和L 4小于距离阈值L T,进一步表明在将目标文本音频对TA 1与待筛选文本音频对TA 2、TA 4和TA 5进行拼接时,可以保证彼此之间的音色、韵律都是较为接近的,以满足后续可以拼接出质量较高的样本数据,即确定待筛选文本音频对TA 2、TA 4和TA 5可以与目标文本音频对TA 1进行拼接,则将文本音频对TA 2、TA 4和TA 5确定为目标文本音频对TA 1的拼接文本音频对。
此外,为了能够提高后续计算其他文本音频对作为目标文本音频对时特征距离的计算效率,可以将所述目标文本音频对作为后向文本音频对,待筛选文本音频对作为前向文本音频对,再计算彼此之间的特征距离。如在计算目标文本音频对TA 1与待筛选文本音频对TA 2的特征距离L 1时,还可以计算目标文本音频对TA 1与待筛选文本音频对TA 2的特征距离L 11,其中L 11=(F0 e2-F0 s1) 2+(E e2-E s1) 2=(N e2-N s1) 2+(M e2-M s1) 2,特征距离L 11表征在将待筛选文本音频对TA 2作为拼接后的前向文本音频对,以及将目标文本音频对TA 1作为拼接后的后向文本音频对时二者之间的匹配程度;实现在计算目标文本音频对与各个待筛选文本音频对的特征距离时,将目标文本音频对作为前向文本音频对和后向文本音频对的特征距离都计算完成,节省在将待筛选文本音频对TA 2作为目标文本音频对时,计算文本音频对TA 1与文本音频对TA 2特征距离的计算过程,从而提高后续计算特征距离的效率。
综上,为了能够从所述多个文本音频对中筛选出与所述目标文本音频对对应的拼接文本音频对,将结合音频特征对所述拼接文本音频对进行筛选,从而实现筛选出的文本音频对与所述目标文本音频对在音色、韵律等属性上是彼此接近的,可以在后续拼接出满足使用需求的待检测文本音频对,以实现对所述训练数据库进行扩充,供下游业务所使用。
步骤S106,将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测。
具体的,在上述基于所述音频特征得到与所述目标文本音频对对应的拼接文本音频对的基础上,进一步的,将对所述目标文本音频对和所述拼接文本音频对进行拼接,得到所述待检测文本音频对,并且由于所述待检测文本音频对是由两个文本音频对拼接而成,为了能够进一步保证写入所述训练数据库中的文本音频对是有质量保证的(拼接后的待检测文本音频对效果较好),在写入所述训练数据库之前,还可以对所述待检测文本音频对进行检测,检测所述待检测文本音频对中音频片段是否清晰,文本片段长度是否合适等等,从而得到质量更佳的文本音频对写入所述训练数据库。
而在此之前,由于所述目标文本音频对可能满足写入所述训练数据库的要求,即无需与其他文本音频对进行拼接即可写入所述训练数据库,因此为了提高所述训练数据库的丰富程度,以及避免遗漏可以写入所述训练数据库的样本数据,可以在进行拼接处理前,判断目标文本音频对是否满足预设检测条件,本实施例中,具体实现方式如下所述:
对所述目标文本音频对中的音频片段进行采样处理获得目标采样信息,以及确定所述目标文本音频对中的文本片段的目标文本信息;
判断所述目标采样信息和所述目标文本信息是否满足所述预设检测条件;
若否,则执行将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测的步骤。
若是,则将所述目标文本音频对写入所述训练数据库。
具体的,所述目标采样信息具体是指对所述目标文本音频对中音频片段进行随机采样时的采样位数和采样频率,所述采样位数可以理解为处理音频片段时的解析度,采样位数越大,解析度越高,音频片段的真实性也就越高,反之采样位数越小,解析度越低,音频片段的真实性也就越低;所述采样频率是指在一秒内对音频片段的采样次数,采样频率越高音频片段的还原就越真实越自然,反之采样频率越低音频片段的还原就越不真实越不自然;所述目标文本信息具体是指所述目标文本音频对中文本片段的长度信息,字符数量信息等;相 应的,所述预设检测条件具体是指检测音频片段和文本片段是否符合写入训练数据库的条件,可以在文本音频对中的音频片段和文本片段都满足预设检测条件的情况下,将其写入所述训练数据库,也可以在文本音频对中的音频片段或文本片段满足预设检测条件的情况下,将其写入所述训练数据库。
基于此,在得到所述目标文本音频对和其对应的拼接文本音频对的基础上,进一步的,将对所述目标文本音频对中的音频片段在[0,1]之间进行随机采样处理,得到所述目标采样信息,同时确定所述目标文本音频对中文本片段的目标文本信息,之后判断所述目标采样信息和所述目标文本信息是否满足所述预设检测条件,若是,说明所述目标文本音频对已经满足了写入所述训练数据库的要求,则可以直接将所述目标文本音频对写入所述训练数据库,作为下游业务的样本数据;若否,说明所述目标文本音频对并未满足写入所述训练数据库的要求,则可以执行步骤S106,再对拼接后的待检测文本音频对进行检测,从而实现均衡写入训练数据库中文本音频对中音频片段和文本片段,保证训练数据库中文本音频对中文本片段和音频片段在形式上(音频长度,文本长度或音频能量等)都是相近或相同的,更加方便下游业务使用。
沿用上例,在从文本音频对TA 1~TA 5中选择文本音频对TA 1作为目标文本音频对的基础上,进一步的,此时将对文本音频对TA 1中的第一音频片段(0s~3s)在[0,1]之间进行随机采样,得到第一音频片段的目标采样信息为U,同时确定文本音频对TA 1中的第一文本片段(我看了)的长度为X个字符,此时判断第一音频片段的目标采样信息U是否大于预设采样Ut,且第一文本片段的长度X是否小于预设文本长度Xt,若是,说明文本音频对TA 1中的第一音频片段和第一文本片段都满足了写入训练数据库T的要求,则此时可以将文本音频对TA 1写入训练数据库T,作为下游业务训练语音合成模型时的样本数据;若否,说明文本音频对TA 1中的第一音频片段或第一文本片段无法满足写入训练数据库T的要求,则此时可以将拼接文本音频对TA 2、TA 4和TA 5与文本音频对TA 1进行拼接,得到多个待检测文本音频对,再通过对待检测文本音频对进行检测,以实现得到满足写入训练数据库T的要求的文本音频对。
综上,在对拼接文本音频对和目标文本音频对进行拼接之前,可以对所述目标文本音频对进行检测,以避免遗漏满足写入训练数据库的文本音频对,进而提高了训练数据库的丰富程度。
进一步的,在拼接所述目标文本音频对和所述拼接文本音频对的过程中,由于各文本音频对中包含文本片段和音频片段,因此需要拼接文本片段的同时拼接音频片段,以生成所述待检测文本音频段,本实施例中,具体实现方式如下所述:
提取所述目标文本音频对中的目标文本片段和目标音频片段,以及所述拼接文本音频对中的拼接文本片段和拼接音频片段;
将所述目标文本片段和所述拼接文本片段拼接为待检测文本片段,以及将所述目标音频片段和所述拼接音频片段拼接为待检测音频片段;
基于所述待检测文本片段和所述待检测音频片段组成所述待检测文本音频对。
具体的,首先提取所述目标文本音频对中的目标文本片段和目标音频片段,同时提取所述拼接文本音频对中的拼接文本片段和拼接音频片段,其次将所述目标文本片段和所述拼接文本片段拼接为待检测文本片段,以及将所述目标音频片段和所述拼接音频片段拼接为待检测音频片段;最后基于所述待检测文本片段和所述待检测音频片段即可组成所述待检测文本音频对。
更进一步的,在对所述待检测文本音频对进行检测的过程中,为了能够保证待检测文本音频对的质量,不仅可以对所述待检测文本音频对中的待检测文本片段进行检测,还可以同时对待检测文本音频对中的待检测音频片段进行检测,从而保证写入训练数据库的文本音频对中的文本片段和音频片段都是满足写入要求的,本实施例中,具体实现方式如下所述:
对所述待检测音频片段进行采样处理获得待检测采样信息,以及确定所述待检测文本片段的待检测文本信息;
基于所述预设检测条件对所述待检测采样信息和所述待检测文本信息进行检测;
在所述待检测采样信息和所述待检测文本信息均满足所述预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库。
具体的,在拼接出所述待检测文本音频对的基础上,进一步的,对所述待检测文本音频对中的待检测音频片段进行随机采样处理,得到所述待检测音频片段的待检测采样信息,同时确定所述待检测文本音频对中所述待检测文本片段的待检测文本信息,之后基于所述预设的检测条件对所述待检测采样信息和所述待检测文本信息进行检测;若所述待检测采样信息和所述待检测文本信息均满足所述预设检测条件,说明所述待检测文本音频对可以写入所述训练数据库,则将所述待检测文本音频对作为样本数据写入所述训练数据库即可,若所述待检测采样信息或所述待检测文本信息不满足所述预设检测条件的情况下,说明所述待检测文本音频对不可以写入所述训练数据库,则舍弃所述待检测文本音频对即可。
沿用上例,在确定目标文本音频对TA 1和拼接文本音频对TA 2、TA 4和TA 5 的基础上,进一步的,将对目标文本音频对和拼接文本音频对进行拼接处理,即提取目标文本音频对TA 1中的第一音频片段(0s~3s)和第一文本片段(我看了),同时提取拼接文本音频对TA 2中的第二音频片段(3s~4s)和第二文本片段(一场),提取拼接文本音频对TA 4中的第四频片段(6s~8s)和第四文本片段(足球),以及提取拼接文本音频对TA 5中第五音频片段(8s~10s)和第五文本片段(比赛)。
进一步的,将第一音频片段和第二音频片段进行拼接,得到第一待检测音频片段(长度为4s),将第一音频片段和第四音频片段进行拼接,得到第二待检测音频片段(长度为5s),以及将第一音频片段和第五音频片段进行拼接,得到第三待检测音频片段(长度为5s);同时将第一文本片段和第二文本片段进行拼接,得到第一待检测文本片段(我看了一场),将第一文本片段和第四文本片段进行拼接,得到第二待检测文本片段(我看了足球),将第一文本片段和第五文本片段进行拼接,得到第三待检测文本片段(我看了比赛);此时将第一待检测音频片段和第一待检测文本片段组合为第一待检测文本音频对,将第二待检测音频片段和第二待检测文本片段组合为第二待检测文本音频对,将第三待检测音频片段和第三待检测文本片段组合为第三待检测文本音频对。
更进一步的,在得到第一待检测文本音频对、第二待检测文本音频对和第三待检测文本音频对的基础上,进一步的将对上述三个待检测文本音频对进行检测,选择能够写入训练数据库的文本音频对作为样本数据,基于此,分别对各个待检测文本音频对中的待检测音频片段在[0,1]之间进行随机采样,确定第一待检测文本音频对中第一待检测音频片段的采样结果为U 1,确定第二待检测文本音频对中第二待检测音频片段的采样结果为U 2,确定第三待检测文本音频对中第三待检测音频片段的采样结果为U 3;同时确定第一待检测文本音频对中第一待检测文本片段的文本长度为X 1,确定第二待检测文本音频对中第二待检测文本片段的文本长度为X 2,确定第三待检测文本音频对中第三待检测文本片段的文本长度为X 3
最后,分别判断采样结果U 1、U 2和U 3是否大于预设采样结果Ut,且文本长度X 1、X 2和X 3是否小于预设文本长度Xt,根据判断结果确定采样结果U 2大于预设采样结果Ut,且文本长度X 2小于预设文本长度Xt,以及采样结果U 3大于预设采样结果Ut,且文本长度X 3小于预设文本长度Xt,即第二待检测文本音频对和第三待检测文本音频对符合写入训练数据库T,则将第二待检测文本音频对(音频5s,文本“我看了足球”),以及第三待检测文本音频对(音频5s,文本“我看了比赛”)作为样本数据写入训练数据库T,以用于后续训练语音合成模型使用。
综上,通过音频维度和文本维度对所述待检测文本音频对进行检测,实现了写入所述训练数据库中的文本音频对都是满足写入要求的,有效的提高了所述训练数据库中样本数据的质量。
步骤S108,在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
具体的,在所述待检测文本音频对满足预设检测条件的情况下,说明所述待检测文本音频对满足写入所述训练数据库的要求,则将所述待检测文本音频对作为样本数据写入所述训练数据库即可,以用于后续在训练语音合成模型时,可以从训练数据库中提取满足训练要求的样本数据,提高训练出的语音合成模型的预测精准度。
基于此,通过对多个文本中的各个文本进行上述处理,即可得到大量符合写入训练数据库的样本数据,并且写入所述训练数据库中的样本数据无论是数量和质量都满足下游训练模型的需求,从而实现在训练语音合成模型前,可以节省数据准备阶段的成本,提高样本数据的丰富程度。
此外,在向所述训练数据库写入样本数据的过程中,考虑到数据库的容量以及下游业务的需求,可以对写入训练数据库的数据条数进行限制,即在向所述训练数据库写入满足预设检测条件的待检测文本音频对时,检测所述训练数据库中文本音频对的条数是否小于等于预设数据量阈值,若小于,说明训练数据库还可以继续写入文本音频对,则将满足预设检测条件的文本音频对写入所述训练数据库即可,若大于,说明训练数据库无法继续写入文本音频对,则停止后续拼接文本音频对的处理即可。
并且在可以向所述训练数据库中写入文本音频对时,为了避免重复存储占用过多的存储资源,还可以在写入训练数据库前,检测训练数据库中是否存在该文本音频对,若存在,则舍弃,继续进行其他文本音频对的拼接处理即可,若不存在,则写入训练数据库后,再继续其他文本音频对的拼接处理即可。
进一步的,在完成所述训练数据库的扩增之后,此时所述训练数据库中的文本音频对即可作为样本数据(样本文本音频对)对下游业务中的语音合成模型进行训练,本实施例中,训练模型的过程如下所述:
在所述训练数据库中提取样本文本音频对,所述样本文本音频对中包含样本文本片段和样本音频片段;
基于所述样本文本片段和所述样本音频片段对语音合成模型进行训练,获得目标语音合成模型。
实际应用中,在对所述语音合成模型进行训练的过程中,可以从所述训练 数据库中提取大量的样本文本音频对,之后基于样本文本音频对中样本文本片段和样本音频片段对所述语音合成模型进行训练,直至得到满足训练停止条件的语音合成模型即可,将其作为目标语音合成模型进行存储,以用于在语音合成场景中可以将文本转换为音频。如文本为“我喜欢看足球比赛”,将该文本输入至语音合成模型进行处理,即可得到该文本对应的音频,实现将文本转换为语音的处理。
此外,在拼接后的待检测文本音频对未满足预设检测条件的情况下,说明所述待检测文本音频对中的待检测音频片段或待检测文本片段可能不满足所述预设检测条件,为了能够得到满足条件的样本数据,此时可以根据所述音频特征在所述多个文本音频对中筛选出与所述拼接文本音频对对应的多度拼接文本音频对,再进行拼接后检测处理,直至得到满足写入训练数据库的文本音频对,需要说明的是,如果一直进行拼接检测处理,所得到的待检测文本音频对可能一直无法满足写入训练数据库的要求,因此在持续进行拼接和检测的过程中,可以设置停止条件,当拼接次数达到一定条件的情况下,即可停止对该文本音频对进行处理,将其舍弃即可,本实施例中,具体实现方式如下所述:
在所述待检测文本音频对未满足预设检测条件的情况下,根据所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对;
将所述多度拼接文本音频对和所述待检测文本音频对拼接为多度待检测文本音频对,并判断所述多度待检测文本音频对是否满足所述预设检测条件;
若是,则将所述多度待检测文本音频对写入所述训练数据库;
若否,则将所述多度拼接文本音频对作为所述拼接文本音频对,以及将所述多度待检测文本音频对作为所述待检测文本音频对,并执行所述根据所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对步骤。
具体的,所述多度拼接文本音频对具体是指与所述拼接文本音频对可以进行拼接的文本音频对;基于此,在所述待检测文本音频对未满足预设检测条件的情况下,说明目标文本音频对和所述拼接文本音频对拼接后的待检测文本音频对不符合写入所述训练数据库的要求,为了能够得到满足写入要求的文本音频对,可以再根据音频特征在所述多个文本音频对中选择能够与拼接文本音频对进行拼接的多度拼接文本音频对,之后将待检测文本音频对和多度拼接文本音频对进行拼接,得到多度待检测文本音频对,再对所述多度待检测文本音频对进行检测,若所述多度待检测文本音频对满足预设检测条件,将其写入所述 训练数据库即可,若所述多度待检测文本音频对不满足预设检测条件,可以将多度待检测文本音频对作为待检测文本音频对,以及将多度拼接文本音频对作为拼接文本音频对,返回执行筛选多度拼接文本音频对的过程,直至得到满足写入训练数据库需求的文本音频对即可,或者直至达到停止拼接条件后舍弃该文本音频对即可。
沿用上例,在得到第一待检测文本音频对不符合写入训练数据库T的要求情况下,由于第一待检测文本音频对中的第一待检测音频片段是由第一音频片段和第二音频片段组成,以及第一待检测文本片段是由第一文本片段和第二为本片段组成,因此可以选择与第二文本音频对TA 2具有拼接可能的第三文本音频对TA 3作为多度拼接文本音频对,之后将多度拼接文本音频对TA 3与第一待检测文本音频对(TA 1+TA 2)进行拼接,得到多度待检测文本音频对(TA 1+TA 2+TA 3)。
同时,确定多度待检测文本音频对中的多度待检测音频片段由第一音频片段、第二音频片段和第三音频片段组成,多度待检测文本音频对中的多度待检测文本片段由第一文本片段、第二文本片段和第三文本片段组成,即多度待检测文本音频对为(音频片段6s,文本片段“我看了一场精彩的”),此时对多度待检测文本音频对进行检测,若该多度待检测文本音频对满足预设检测条件,则将多度待检测文本音频对写入训练数据库T中即可,若多度待检测文本音频对不满足预设检测条件,可以再选择与第三文本音频对TA 3具有拼接可能的文本音频对进行拼接和检测处理,或者舍弃该多度待检测文本音频对,选择其他文本音频对进行上述处理,以得到满足写入训练数据库T的样本数据即可。
综上,通过循环拼接的方式保证所述训练数据库中样本数据的均衡性,不仅可以方便下游业务在训练模型时使用,还能够提高所述训练数据库的丰富程度,进而有效的保证了下游业务的使用需求。
本说明书提供一种样本生成方法,在获取到多个文本音频对后,计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对,以及所述目标文本音频对对应的拼接文本音频对,之后将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测,在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库,实现在样本数据准备阶段,可以通过拼接的方式得到质量高且满足下游业务使用需求的样本数据,节省数据准备阶段的资源消耗成本,并且拼接后写入所述训练数据库中的样本数据的数据量较多,有效的解决了下游业务样本数据量少、样本数据中音频长短分布不均匀导致语音合成效果不佳的问题, 从而提高下游业务的业务处理效率。
下述结合附图5,以本说明书提供的样本生成方法在语音合成场景中的应用为例,对所述样本生成方法进行进一步说明。其中,图5示出了本说明书一实施例提供的一种应用于语音合成场景中的样本生成方法的处理流程图,具体包括以下步骤:
步骤S502,获取目标文本以及目标文本对应的音频。
实际应用中,在基于神经网络的端到端语音合成方法中,由于该方法的特性,需要在模型训练前准备部分质量较高的样本数据,以实现训练出满足使用需求的语音合成模型;而这部分样本数据的往往需要在专业的录音棚中进行录制后,再进行修剪和整理后才能够用于训练模型,不仅需要消耗较多的时间才能够完成数据的准备,而且成本消耗上也是一笔较大的开销;同时由于样本数据的要求较为严格,导致最终能够用于训练模型的数据少之更少,进而无法得到覆盖长度和韵律较为全面的样本数据,从而影响语音合成时的音色不像、韵律(音调起伏)不自然等问题。因此在样本数据的准备阶段如何生成质量高且属性较为丰富的样本数据是亟需解决的问题。
本实施例提供一种应用于语音合成场景中的样本生成方法,以解决上述问题。
步骤S504,对音频进行预处理获得目标音频,并将目标文本转换为音素序列。
步骤S506,将音素序列与目标音频进行对齐处理,根据对齐处理结果得到音素音频文件,并确定音素音频文件的切分位置。
步骤S508,按照切分位置对音素音频文件进行切分获得多个音素音频对,并基于目标文本确定多个音素音频对中的每个音素音频对的音素片段对应的文本片段。
步骤S510,根据每个音素音频对中音素片段对应的文本片段,以及每个音素音频对中的音频片段生成多个文本音频对。
步骤S512,提取多个文本音频对中每个文本音频对的音频片段,并对每个文本音频对的音频片段进行分帧处理,获得每个文本音频对的音频帧集合。
步骤S514,基于多个文本音频对中每个文本音频对的音频帧集合包含的音频帧,计算每个文本音频对的音频片段的基音频率特征和音频帧特征。
步骤S516,将多个文本音频对中每个文本音频对的音频片段、文本片段、基音频率特征和音频帧特征进行整合,获得每个文本音频对对应的文本音频包,并写入片段数据库。
步骤S518,在片段数据库中选择任意一个文本音频包作为目标文本音频包,并将目标文本音频包中的文本音频对确定为目标文本音频对。
步骤S520,在片段数据库中选择除目标文本音频包外的文本音频包组成待筛选文本音频包集合。
步骤S522,将待筛选文本音频包集合中包含的各个待筛选文本音频包的文本音频对确定为待筛选文本音频对。
步骤S524,根据目标文本音频包确定目标文本音频对的音频片段的基音频率特征和音频帧特征,以及根据待筛选文本音频包确定待筛选文本音频对的音频片段的基音频率特征和音频帧特征。
步骤S526,基于目标文本音频对的音频片段的基音频率特征和音频帧特征,以及待筛选文本音频对的音频片段的基音频率特征和音频帧特征计算特征距离。
步骤S528,将特征距离小于预设距离阈值的待筛选文本音频对所属的待筛选文本音频包确定为拼接文本音频包。
步骤S530,将拼接文本音频包中的文本音频对作为拼接文本音频对。
步骤S532,提取目标文本音频对中的目标文本片段和目标音频片段,以及拼接文本音频对中的拼接文本片段和拼接音频片段。
步骤S534,将目标文本片段和拼接文本片段拼接为待检测文本片段,以及将目标音频片段和拼接音频片段拼接为待检测音频片段。
步骤S536,基于待检测文本片段和待检测音频片段组成待检测文本音频对。
步骤S538,对待检测文本音频对中的待检测音频片段进行采样处理获得待检测采样信息,以及确定待检测文本音频对中的待检测文本片段的待检测文本信息。
步骤S540,在待检测采样信息和待检测文本信息均满足预设检测条件的情况下,将待检测文本音频对写入训练数据库。
本说明书提供一种样本生成方法,实现在样本数据准备阶段,可以通过拼接的方式得到质量高且满足下游业务使用需求的样本数据,节省数据准备阶段的资源消耗成本,并且拼接后写入所述训练数据库中的样本数据的数据量较多,有效的解决了下游业务样本数据量少、样本数据中音频长短分布不均匀导致语音合成效果不佳的问题,从而提高下游业务的业务处理效率。
与上述方法实施例相对应,本说明书还提供了样本生成装置实施例,图5 示出了本说明书一实施例提供的一种样本生成装置的结构示意图。如图5所示,该装置包括:
获取模块602,被配置为获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
计算模块604,被配置为计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
拼接模块606,被配置为将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
写入模块608,被配置为在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
一个可选的实施例中,所述获取模块602进一步被配置为:
获取目标文本以及所述目标文本对应的音频;对所述音频进行预处理获得目标音频,并将所述目标文本转换为音素序列;将所述音素序列与所述目标音频进行对齐处理,并根据对齐处理结果生成所述多个文本音频对。
一个可选的实施例中,所述获取模块602进一步被配置为:
根据对齐处理结果得到音素音频文件,并确定所述音素音频文件的切分位置;按照所述切分位置对所述音素音频文件进行切分,获得多个音素音频对,其中每个音素音频对中包含音素片段和音频片段;基于所述目标文本确定所述多个音素音频对中的每个音素音频对的音素片段对应的文本片段;根据每个音素音频对中音素片段对应的文本片段,以及每个音素音频对中的音频片段生成所述多个文本音频对。
一个可选的实施例中,所述计算模块604进一步被配置为:
提取所述多个文本音频对中每个文本音频对的音频片段,并对每个文本音频对的音频片段进行分帧处理,获得每个文本音频对的音频帧集合;基于所述多个文本音频对中每个文本音频对的音频帧集合包含的音频帧,计算每个文本音频对的音频片段的基音频率特征和音频帧特征;根据每个文本音频对的音频片段的所述基音频率特征和所述音频帧特征,确定每个文本音频对的音频片段的所述音频特征。
一个可选的实施例中,所述计算模块604进一步被配置为:
将所述多个文本音频对中每个文本音频对的音频片段、文本片段和音频特征进行整合,获得每个文本音频对对应的文本音频包,并写入片段数据库;在所述片段数据库中选择任意一个文本音频包作为目标文本音频包,并将所述目 标文本音频包中的文本音频对确定为所述目标文本音频对;基于所述片段数据库中除所述目标文本音频包外的文本音频包和所述音频特征确定拼接文本音频包,并将所述拼接文本音频包中的文本音频对作为所述拼接文本音频对。
一个可选的实施例中,所述计算模块604进一步被配置为:
在所述片段数据库中选择除所述目标文本音频包外的文本音频包组成待筛选文本音频包集合;将所述待筛选文本音频包集合中包含的各个待筛选文本音频包的文本音频对确定为待筛选文本音频对;基于所述目标文本音频对的音频片段的音频特征和所述待筛选文本音频对的音频片段的音频特征,在所述待筛选文本音频包集合中筛选出所述拼接文本音频包。
一个可选的实施例中,所述计算模块604进一步被配置为:
根据所述目标文本音频包确定所述目标文本音频对的音频片段的第一音频特征,以及根据所述待筛选文本音频包确定所述待筛选文本音频对的音频片段的第二音频特征;计算所述第一音频特征和所述第二音频特征之间的特征距离;将所述特征距离小于预设距离阈值的待筛选文本音频对所属的待筛选文本音频包确定为所述拼接文本音频包。
一个可选的实施例中,所述样本生成装置,还包括:
采样模块,被配置为对所述目标文本音频对中的音频片段进行采样处理获得目标采样信息,以及确定所述目标文本音频对中的文本片段的目标文本信息;判断所述目标采样信息和所述目标文本信息是否满足所述预设检测条件;
若否,则运行所述拼接模块606。
一个可选的实施例中,若所述采样模块的判断结果为是,则将所述目标文本音频对写入所述训练数据库。
一个可选的实施例中,所述拼接模块606进一步被配置为:
提取所述目标文本音频对中的目标文本片段和目标音频片段,以及所述拼接文本音频对中的拼接文本片段和拼接音频片段;将所述目标文本片段和所述拼接文本片段拼接为待检测文本片段,以及将所述目标音频片段和所述拼接音频片段拼接为待检测音频片段;基于所述待检测文本片段和所述待检测音频片段组成所述待检测文本音频对。
一个可选的实施例中,所述拼接模块606进一步被配置为:
对所述待检测音频片段进行采样处理获得待检测采样信息,以及确定所述待检测文本片段的待检测文本信息;基于所述预设检测条件对所述待检测采样信息和所述待检测文本信息进行检测;
相应的,所述写入模块608进一步被配置为:
在所述待检测采样信息和所述待检测文本信息均满足所述预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库。
一个可选的实施例中,所述样本生成装置,还包括:
筛选模块,被配置为在所述待检测文本音频对未满足预设检测条件的情况下,根据所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对;将所述多度拼接文本音频对和所述待检测文本音频对拼接为多度待检测文本音频对,并判断所述多度待检测文本音频对是否满足所述预设检测条件;
若是,则将所述多度待检测文本音频对写入所述训练数据库;
若否,则将所述多度拼接文本音频对作为所述拼接文本音频对,以及将所述多度待检测文本音频对作为所述待检测文本音频对,并运行所述筛选模块。
一个可选的实施例中,所述样本生成装置,还包括:
训练模块,被配置为在所述训练数据库中提取样本文本音频对,所述样本文本音频对中包含样本文本片段和样本音频片段;基于所述样本文本片段和所述样本音频片段对语音合成模型进行训练,获得目标语音合成模型。
本实施例提供的样本生成装置,在获取到多个文本音频对后,计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对,以及所述目标文本音频对对应的拼接文本音频对,之后将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测,在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库,实现在样本数据准备阶段,可以通过拼接的方式得到质量高且满足下游业务使用需求的样本数据,节省数据准备阶段的资源消耗成本,并且拼接后写入所述训练数据库中的样本数据的数据量较多,有效的解决了下游业务样本数据量少、样本数据中音频长短分布不均匀导致语音合成效果不佳的问题,从而提高下游业务的业务处理效率。
上述为本实施例的一种样本生成装置的示意性方案。需要说明的是,该样本生成装置的技术方案与上述的样本生成方法的技术方案属于同一构思,样本生成装置的技术方案未详细描述的细节内容,均可以参见上述样本生成方法的技术方案的描述。
图7示出了根据本说明书一实施例提供的一种计算设备700的结构框图。该计算设备700的部件包括但不限于存储器710和处理器720。处理器720与 存储器710通过总线730相连接,数据库750用于保存数据。
计算设备700还包括接入设备740,接入设备740使得计算设备700能够经由一个或多个网络760通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备740可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE502.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本说明书的一个实施例中,计算设备700的上述部件以及图7中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图7所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备700可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备700还可以是移动式或静止式的服务器。
其中,处理器720用于执行如下计算机可执行指令:
获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的样本生成方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述样本生成方法的技术方案的描述。
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时以用于:
获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的样本生成方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述样本生成方法的技术方案的描述。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本说明书并不受所描述的动作顺序的限制,因为依据本说明书,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本说明书所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本说明书优选实施例只是用于帮助阐述本说明书。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然, 根据本说明书的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本说明书的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本说明书。本说明书仅受权利要求书及其全部范围和等效物的限制。

Claims (16)

  1. 一种样本生成方法,其特征在于,包括:
    获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
    计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
    将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
    在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
  2. 根据权利要求1所述的样本生成方法,其特征在于,所述获取多个文本音频对,包括:
    获取目标文本以及所述目标文本对应的音频;
    对所述音频进行预处理获得目标音频,并将所述目标文本转换为音素序列;
    将所述音素序列与所述目标音频进行对齐处理,并根据对齐处理结果生成所述多个文本音频对。
  3. 根据权利要求2所述的样本生成方法,其特征在于,所述根据对齐处理结果生成所述多个文本音频对,包括:
    根据对齐处理结果得到音素音频文件,并确定所述音素音频文件的切分位置;
    按照所述切分位置对所述音素音频文件进行切分,获得多个音素音频对,其中每个音素音频对中包含音素片段和音频片段;
    基于所述目标文本确定所述多个音素音频对中的每个音素音频对的音素片段对应的文本片段;
    根据每个音素音频对中音素片段对应的文本片段,以及每个音素音频对中的音频片段生成所述多个文本音频对。
  4. 根据权利要求1所述的样本生成方法,其特征在于,所述计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,包括:
    提取所述多个文本音频对中每个文本音频对的音频片段,并对每个文本音频对的音频片段进行分帧处理,获得每个文本音频对的音频帧集合;
    基于所述多个文本音频对中每个文本音频对的音频帧集合包含的音频帧,计算每个文本音频对的音频片段的基音频率特征和音频帧特征;
    根据每个文本音频对的音频片段的所述基音频率特征和所述音频帧特征,确定每个文本音频对的音频片段的所述音频特征。
  5. 根据权利要求1所述的样本生成方法,其特征在于,所述根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对,包括:
    将所述多个文本音频对中每个文本音频对的音频片段、文本片段和音频特征进行整合,获得每个文本音频对对应的文本音频包,并写入片段数据库;
    在所述片段数据库中选择任意一个文本音频包作为目标文本音频包,并将所述目标文本音频包中的文本音频对确定为所述目标文本音频对;
    基于所述片段数据库中除所述目标文本音频包外的文本音频包和所述音频特征确定拼接文本音频包,并将所述拼接文本音频包中的文本音频对作为所述拼接文本音频对。
  6. 根据权利要求5所述的样本生成方法,其特征在于,所述基于所述片段数据库中除所述目标文本音频包外的文本音频包和所述音频特征确定拼接文本音频包,包括:
    在所述片段数据库中选择除所述目标文本音频包外的文本音频包组成待筛选文本音频包集合;
    将所述待筛选文本音频包集合中包含的各个待筛选文本音频包的文本音频对确定为待筛选文本音频对;
    基于所述目标文本音频对的音频片段的音频特征和所述待筛选文本音频对的音频片段的音频特征,在所述待筛选文本音频包集合中筛选出所述拼接文本音频包。
  7. 根据权利要求6所述的样本生成方法,其特征在于,所述基于所述目标文本音频对的音频片段的音频特征和所述待筛选文本音频对的音频片段的音频特征,在所述待筛选文本音频包集合中筛选出所述拼接文本音频包,包括:
    根据所述目标文本音频包确定所述目标文本音频对的音频片段的第一音频特征,以及根据所述待筛选文本音频包确定所述待筛选文本音频对的音频片段的第二音频特征;
    计算所述第一音频特征和所述第二音频特征之间的特征距离;
    将所述特征距离小于预设距离阈值的待筛选文本音频对所属的待筛选文本音频包确定为所述拼接文本音频包。
  8. 根据权利要求1所述的样本生成方法,其特征在于,所述将所述目标 文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测步骤执行之前,还包括:
    对所述目标文本音频对中的音频片段进行采样处理获得目标采样信息,以及确定所述目标文本音频对中的文本片段的目标文本信息;
    判断所述目标采样信息和所述目标文本信息是否满足所述预设检测条件;
    若否,则执行将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测的步骤。
  9. 根据权利要求8所述的样本生成方法,其特征在于,若所述判断所述采样信息和所述文本信息是否满足所述预设检测条件的判断结果为是,则将所述目标文本音频对写入所述训练数据库。
  10. 根据权利要求1所述的样本生成方法,其特征在于,所述将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,包括:
    提取所述目标文本音频对中的目标文本片段和目标音频片段,以及所述拼接文本音频对中的拼接文本片段和拼接音频片段;
    将所述目标文本片段和所述拼接文本片段拼接为待检测文本片段,以及将所述目标音频片段和所述拼接音频片段拼接为待检测音频片段;
    基于所述待检测文本片段和所述待检测音频片段组成所述待检测文本音频对。
  11. 根据权利要求10所述的样本生成方法,其特征在于,所述对所述待检测文本音频对进行检测,包括:
    对所述待检测音频片段进行采样处理获得待检测采样信息,以及确定所述待检测文本片段的待检测文本信息;
    基于所述预设检测条件对所述待检测采样信息和所述待检测文本信息进行检测;
    相应的,所述在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库,包括:
    在所述待检测采样信息和所述待检测文本信息均满足所述预设检测条件的情况下,将所述待检测文本音频对写入所述训练数据库。
  12. 根据权利要求1所述的样本生成方法,其特征在于,所述将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测步骤执行之后,还包括:
    在所述待检测文本音频对未满足预设检测条件的情况下,根据所述音频特 征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对;
    将所述多度拼接文本音频对和所述待检测文本音频对拼接为多度待检测文本音频对,并判断所述多度待检测文本音频对是否满足所述预设检测条件;
    若是,则将所述多度待检测文本音频对写入所述训练数据库;
    若否,则将所述多度拼接文本音频对作为所述拼接文本音频对,以及将所述多度待检测文本音频对作为所述待检测文本音频对,并执行所述根据所述音频特征在所述多个文本音频对中筛选出所述拼接文本音频对对应的多度拼接文本音频对步骤。
  13. 根据权利要求1所述的样本生成方法,其特征在于,所述将所述待检测文本音频对写入训练数据库步骤执行之后,还包括:
    在所述训练数据库中提取样本文本音频对,所述样本文本音频对中包含样本文本片段和样本音频片段;
    基于所述样本文本片段和所述样本音频片段对语音合成模型进行训练,获得目标语音合成模型。
  14. 一种样本生成装置,其特征在于,包括:
    获取模块,被配置为获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
    计算模块,被配置为计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标文本音频对对应的拼接文本音频对;
    拼接模块,被配置为将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
    写入模块,被配置为在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
  15. 一种计算设备,其特征在于,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:
    获取多个文本音频对,其中每个文本音频对中包含文本片段和音频片段;
    计算所述多个文本音频对中每个文本音频对的音频片段的音频特征,并根据所述音频特征在所述多个文本音频对中筛选出目标文本音频对和所述目标 文本音频对对应的拼接文本音频对;
    将所述目标文本音频对和所述拼接文本音频对拼接为待检测文本音频对,并对所述待检测文本音频对进行检测;
    在所述待检测文本音频对满足预设检测条件的情况下,将所述待检测文本音频对写入训练数据库。
  16. 一种计算机可读存储介质,其存储有计算机指令,其特征在于,该指令被处理器执行时实现权利要求1至13任意一项所述样本生成方法的步骤。
PCT/CN2021/130459 2020-11-20 2021-11-12 样本生成方法及装置 WO2022105693A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020237017827A KR20230079503A (ko) 2020-11-20 2021-11-12 샘플 생성 방법 및 장치
US18/253,717 US11810546B2 (en) 2020-11-20 2021-11-12 Sample generation method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011309190.7 2020-11-20
CN202011309190.7A CN112133277B (zh) 2020-11-20 2020-11-20 样本生成方法及装置

Publications (1)

Publication Number Publication Date
WO2022105693A1 true WO2022105693A1 (zh) 2022-05-27

Family

ID=73852445

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130459 WO2022105693A1 (zh) 2020-11-20 2021-11-12 样本生成方法及装置

Country Status (4)

Country Link
US (1) US11810546B2 (zh)
KR (1) KR20230079503A (zh)
CN (1) CN112133277B (zh)
WO (1) WO2022105693A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366478A (zh) * 2024-06-19 2024-07-19 中国科学院自动化研究所 基于音素间隔序列的生成音频鉴别与生成区域定位方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133277B (zh) * 2020-11-20 2021-02-26 北京猿力未来科技有限公司 样本生成方法及装置
CN112686041B (zh) * 2021-01-06 2024-06-04 北京猿力未来科技有限公司 一种拼音标注方法及装置
CN112863530B (zh) * 2021-01-07 2024-08-27 广州欢城文化传媒有限公司 一种声音作品的生成方法和装置
CN113241054B (zh) * 2021-05-10 2023-03-21 北京声智科技有限公司 语音平滑处理模型生成方法、语音平滑处理方法及装置
CN113658581B (zh) * 2021-08-18 2024-03-01 北京百度网讯科技有限公司 声学模型的训练、语音处理方法、装置、设备及存储介质
CN114694629B (zh) * 2022-04-08 2024-09-10 思必驰科技股份有限公司 用于语音合成的语音数据扩增方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060229876A1 (en) * 2005-04-07 2006-10-12 International Business Machines Corporation Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
CN105336322A (zh) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 多音字模型训练方法、语音合成方法及装置
CN109817198A (zh) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 用于语音合成的多发音训练方法、语音合成方法与装置
CN110310626A (zh) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 语音训练数据生成方法、装置、设备及可读存储介质
CN112133277A (zh) * 2020-11-20 2020-12-25 北京猿力未来科技有限公司 样本生成方法及装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6618699B1 (en) * 1999-08-30 2003-09-09 Lucent Technologies Inc. Formant tracking based on phoneme information
US7983919B2 (en) * 2007-08-09 2011-07-19 At&T Intellectual Property Ii, L.P. System and method for performing speech synthesis with a cache of phoneme sequences
CA2841883A1 (en) * 2011-07-25 2013-01-31 Frank RUDZICZ System and method for acoustic transformation
US9881631B2 (en) * 2014-10-21 2018-01-30 Mitsubishi Electric Research Laboratories, Inc. Method for enhancing audio signal using phase information
GB2544070B (en) * 2015-11-04 2021-12-29 The Chancellor Masters And Scholars Of The Univ Of Cambridge Speech processing system and method
US11961589B2 (en) * 2017-11-28 2024-04-16 Grail, Llc Models for targeted sequencing
US11170761B2 (en) * 2018-12-04 2021-11-09 Sorenson Ip Holdings, Llc Training of speech recognition systems
CN110428811B (zh) * 2019-09-17 2021-09-07 北京声智科技有限公司 一种数据处理方法、装置及电子设备
CN110689879B (zh) * 2019-10-10 2022-02-25 中国科学院自动化研究所 端到端语音转写模型的训练方法、系统、装置
CN111126001A (zh) * 2019-11-19 2020-05-08 深圳追一科技有限公司 文字标注方法、装置、设备及存储介质
US11514948B1 (en) * 2020-01-09 2022-11-29 Amazon Technologies, Inc. Model-based dubbing to translate spoken audio in a video
CN111862942B (zh) * 2020-07-28 2022-05-06 思必驰科技股份有限公司 普通话和四川话的混合语音识别模型的训练方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060229876A1 (en) * 2005-04-07 2006-10-12 International Business Machines Corporation Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
CN105336322A (zh) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 多音字模型训练方法、语音合成方法及装置
CN109817198A (zh) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 用于语音合成的多发音训练方法、语音合成方法与装置
CN110310626A (zh) * 2019-05-23 2019-10-08 平安科技(深圳)有限公司 语音训练数据生成方法、装置、设备及可读存储介质
CN112133277A (zh) * 2020-11-20 2020-12-25 北京猿力未来科技有限公司 样本生成方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118366478A (zh) * 2024-06-19 2024-07-19 中国科学院自动化研究所 基于音素间隔序列的生成音频鉴别与生成区域定位方法

Also Published As

Publication number Publication date
US20230317052A1 (en) 2023-10-05
CN112133277B (zh) 2021-02-26
KR20230079503A (ko) 2023-06-07
CN112133277A (zh) 2020-12-25
US11810546B2 (en) 2023-11-07

Similar Documents

Publication Publication Date Title
WO2022105693A1 (zh) 样本生成方法及装置
US10878803B2 (en) Speech conversion method, computer device, and storage medium
WO2020024690A1 (zh) 语音标注方法、装置及设备
US20180349495A1 (en) Audio data processing method and apparatus, and computer storage medium
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
US8165874B2 (en) System, method, and program product for processing speech ratio difference data variations in a conversation between two persons
US8326610B2 (en) Producing phonitos based on feature vectors
BR112016025110B1 (pt) Gerenciamento de perfil de voz e geração de sinal de fala
CN103377651B (zh) 语音自动合成装置及方法
CN111433847A (zh) 语音转换的方法及训练方法、智能装置和存储介质
US9058384B2 (en) System and method for identification of highly-variable vocalizations
CN108877779B (zh) 用于检测语音尾点的方法和装置
Mittal et al. Study of characteristics of aperiodicity in Noh voices
CN104252872A (zh) 歌词生成方法和智能终端
JP2012108451A (ja) 音声処理装置および方法、並びにプログラム
CN109584888A (zh) 基于机器学习的鸣笛识别方法
CN112420015A (zh) 一种音频合成方法、装置、设备及计算机可读存储介质
CN106098081A (zh) 声音文件的音质识别方法及装置
Wei et al. RMVPE: A robust model for vocal pitch estimation in polyphonic music
CN114302301B (zh) 频响校正方法及相关产品
CN107025902B (zh) 数据处理方法及装置
CN114373478A (zh) 歌曲音频标注与对齐模型训练方法、设备及存储介质
CN109495786B (zh) 视频处理参数信息的预配置方法、装置及电子设备
CN111341298A (zh) 一种语音识别算法评分方法
CN115206345B (zh) 基于时频结合的音乐人声分离方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893840

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202317034541

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 20237017827

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893840

Country of ref document: EP

Kind code of ref document: A1