WO2020062680A1 - Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium - Google Patents

Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium Download PDF

Info

Publication number
WO2020062680A1
WO2020062680A1 PCT/CN2018/124440 CN2018124440W WO2020062680A1 WO 2020062680 A1 WO2020062680 A1 WO 2020062680A1 CN 2018124440 W CN2018124440 W CN 2018124440W WO 2020062680 A1 WO2020062680 A1 WO 2020062680A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
phrase
text
syllable
word
Prior art date
Application number
PCT/CN2018/124440
Other languages
French (fr)
Chinese (zh)
Inventor
房树明
程宁
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020062680A1 publication Critical patent/WO2020062680A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of speech splicing synthesis, and relates to a method, a device, a device, and a storage medium for waveform splicing based on a two-syllable mashup.
  • the existing speech synthesis methods include two methods based on speech feature parameters and waveform stitching. Compared with the parameter-based method, the speech synthesis based on waveform splicing can obtain higher-quality synthesized speech, and the sound sounds more natural and closer to the original voice of the person who pronounced it. Therefore, the current mainstream online speech synthesis focuses on the use of waveform splicing-based speech synthesis solutions.
  • the so-called waveform splicing is to use recordings of different lengths as the basic unit of the speech database for synthesizing speech of any length.
  • the corresponding basic unit in the splicing sound library is a simple and effective solution for generating very natural speech.
  • it is less complex than all other speech synthesis schemes.
  • a general principle is that the longer the selected speech unit is, the more natural the synthesized speech is, but the larger the size of the speech database is, it may be too large to cover the entire continuous pronunciation system in a certain engineering cycle.
  • the technical problem to be solved in this application is to overcome the contradiction between the naturalness of synthesized speech and the reduction of the size of the speech database in the prior art.
  • a method, device, device and storage medium for waveform splicing based on dual-syllable mashup are proposed. It can guarantee the synthesis of high-quality continuous speech, and can cover the continuous pronunciation system in a specific scene in a short time.
  • a method for waveform splicing based on a two-syllable mashup includes the following steps:
  • Sound bank production The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
  • Text preprocessing regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
  • Phrase wave splicing taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
  • Text and audio splicing According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  • This application also discloses a wave splicing device based on a two-syllable mashup, including:
  • a sound library production module which is used to divide the audio of a disyllable word into three parts of the front, middle and back according to the vowels, and each piece of audio is saved to the sound library as a primitive speech segment required for waveform splicing;
  • a text preprocessing module which is used to regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
  • Phrase waveform splicing module which is based on the phrase after the word segmentation as a unit, and regards each two adjacent words in the phrase as a two-syllable word to be converted, and searches the sound library for the first two-syllable word to be converted in the phrase
  • the order of the words in the phrase stitching the found primitive speech segments into audio files of the phrase;
  • the text audio splicing module is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.
  • the present application also discloses a computer device including a memory and a processor.
  • the memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
  • Sound bank production The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
  • Text preprocessing regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
  • Phrase wave splicing taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
  • Text and audio splicing According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  • the present application also discloses a computer-readable storage medium.
  • a computer program is stored in the computer-readable storage medium, and the computer program can be executed by at least one processor to implement the following steps:
  • Sound bank production The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
  • Text preprocessing regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
  • Phrase wave splicing taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
  • Text and audio splicing According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  • FIG. 1 shows a flowchart of Embodiment 1 of a method for waveform splicing based on a two-syllable mashup in the present application
  • FIG. 2 shows a flowchart of text preprocessing steps in a first embodiment of a two-syllable mashup-based wave splicing method
  • FIG. 3 shows a flowchart of a second embodiment of a method for waveform splicing based on a two-syllable mashup
  • Figure 4 shows the original audio waveform diagram
  • Figure 5 shows a standard audio waveform diagram
  • FIG. 6 shows a structural diagram of a first embodiment of a waveform splicing device based on a two-syllable mashup in the present application
  • FIG. 7 is a structural diagram of a second embodiment of a waveform splicing device based on a two-syllable mashup in the present application.
  • FIG. 8 is a schematic diagram of a hardware architecture of an embodiment of a computer device of the present application.
  • this application proposes a method for wave stitching based on a two-syllable mashup.
  • the method for splicing waveforms based on a two-syllable mashup includes the following steps:
  • Step 10 Production of the sound bank: The standard audio of the two-syllable words is divided into front, middle, and back audio according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing.
  • the so-called standard audio refers to audio that contains only pronunciation parts.
  • the Chinese vowel vowel sound waveform (professional customer service when reading the two-syllable words aloud will generate sound waves, which can be displayed in the form of a waveform.
  • the vowel sound waveform refers to the vowel part of the sound wave The waveform corresponding to that part of the sound) The zero point to the left of the highest point in the middle is used as the demarcation point.
  • the three pieces of audio obtained after segmentation are saved to the sound library as primitive speech fragments.
  • the file name of each primitive speech fragment is named after the pinyin, tone, and position of the two-syllable word corresponding to the primitive speech fragment.
  • tones generally use the numbers 1-4 to represent the first to fourth tones, respectively, and the tones of each word directly follow the pinyin of the word.
  • the rank indicates the order of the three audio segments after the segmentation.
  • the numbers 0- 2 indicates the first audio segment to the third audio segment.
  • the standard audio file for the two-syllable word “hello” is “ni2_hao3.wav", the first split position is the middle of your vowel, and the second split position is the middle of the good vowel;
  • the three audio segments are saved into the sound library as primitive speech segments.
  • the file names of the three primitive speech segments are "ni2_hao3_0.wav”, "ni2_hao3_1.wav”, and "ni2_hao3_2.wav”.
  • Step 20 Text pre-processing: regularize the text to be converted into speech, segment the words according to the speaking rules to form a phrase, and mark the pinyin and tone.
  • the text preprocessing specifically includes the following three steps:
  • Step 21 Text regularization: Non-Chinese and English characters included in the text are converted according to a preset processing rule, so that the text contains only Chinese and English and spaces.
  • the English speech waveform splicing method is used in English, which is different from the Chinese speech waveform splicing method.
  • This application is only for the Chinese speech waveform splicing method.
  • the English part is reserved during the text regularization process.
  • the preset processing rule may specifically be to replace Arabic numerals with Chinese characters and punctuation marks with spaces. For example: The eleven-digit telephone number “13888886666” is processed as " ⁇ 38 888 866-6666". Assuming letters are included, the letters are not processed.
  • Step 22 Text segmentation: divide the text into several phrases according to the Chinese speaking rules, and add a space between each phrase to indicate a pause.
  • the speaking rule is a sentence segmentation rule when the Chinese language is read aloud. Take the telephone number as an example, the area code + 7 or 8 digit number, we are used to pause after speaking the area code, the 7 or 8 digit number is usually divided into two parts and paused in the middle; taking reading as an example, usually encounter The punctuation marks are paused, and the long sentence is paused.
  • the aforementioned phone number " ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ " after segmentation is " ⁇ ⁇ ⁇ ⁇ six six six six six”.
  • consecutive letters are treated like a phrase, for example, "one two three BC four five” after segmentation is "one two three BC four five”.
  • Step 23 Pinyin labeling: label the text after the word segmentation with pinyin and tone. Among them, sound call numbers 1-4 indicate.
  • Step 30 Splicing of phrase waveforms: using the phrase after the word segmentation as a unit, take each two adjacent words in the phrase as a two-syllable word to be converted, and find the first two-syllable to be converted from the phrase in the sound library. The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, the found primitive speech segments are spliced into an audio file of the phrase in turn.
  • the audio of each phrase after the word segmentation is the smallest audio file. This smallest audio file is obtained by splicing a number of primitive speech fragments.
  • the phrase here is composed of several words and / or phrases without pause in a sentence. Since the primitive speech segments are cut from the audio of the two-syllable words, the splicing of pair of speech waveforms is needed to achieve speech kneading. Suitable purpose. Therefore, here every two adjacent words in the phrase are divided as one disyllable word to be converted, that is, assuming that the phrase is composed of n words, then n + 1 disyllable words to be converted will be obtained by division. The second word in the converted disyllable word is the first word in the next two-syllable word to be converted.
  • n + 1 disyllable words to be converted are sorted according to their order in the phrase to determine the first disyllable word to be converted and the first one of the n + 1 disyllable words to be converted. n + 1 disyllable words to be converted.
  • n + 1 disyllable words to be converted When the phrase is divided into n + 1 disyllable words to be converted, the corresponding pinyin and tone of the phrase are also divided according to the same rules, and the divided n + 1 pairs of pinyin and tone and the divided The n + 1 disyllable words to be converted correspond one-to-one.
  • the marked pinyin and tone correspond one-to-one to each word in the phrase, that is, each word in the phrase will be marked with a pair of pinyin and tone.
  • the first 1 when the first 1 is recognized, it means the end of the pinyin and tone corresponding to the first word “ ⁇ ” is “yao1", and then starting from the next letter s, the second 1 is recognized , Indicating the end of the pinyin and tone corresponding to the second word “three” is “san1”, and then the first pinyin and tone corresponding to the two-syllable word " ⁇ ⁇ ” is converted to "yao1 san1", the second to be converted
  • the pinyin and tone division of the two-syllable word "three-eight" is the same as above, and is not repeated here.
  • each disyllable word to be converted specifically, using the disyllable word to be converted as a unit, obtain the text marked with pinyin and tone corresponding to the disyllable word to be converted, and find the file name from the phonetic library
  • the phonetic segments containing the marked pinyin and tones of the two-syllable word to be converted are included.
  • the first two-syllable words take their corresponding first and middle primitive phonetic fragments
  • the last two-syllable take The corresponding middle and last two primitive speech fragments, if there are other two-syllable words in the middle, only the corresponding middle primitive speech fragments are taken. That is, assuming that a phrase consists of n words, it should be composed of n + 2 primitive speech segments.
  • the first phrase " ⁇ ⁇ ” divides the two disyllable words “ ⁇ ⁇ ” and “ ⁇ ” to be converted, and finds the first and middle two phonetic fragments corresponding to the double syllable word “ ⁇ ⁇ ” to be converted, respectively.
  • the voice clips are spliced through the waveform, the first phrase " ⁇ 38" is obtained.
  • the naming rule is that the file name corresponds to the pinyin and tone marked on the phrase, then Pinyin and tone are added with a suffix as the file name), and the file name of this audio file is set to "yao1_san1_ba1.wav" for temporary storage.
  • the second phrase "eight-eight-eight-eight-eight" divides three to-be-converted two-syllable words “eight-eight”, “eight-eight”, and "eight-eight".
  • the first two-syllable word “eight-eight” corresponding to the first
  • the middle two primary speech segments are "ba1_ba1_0" and "ba1_ba1_1”
  • the second middle speech segment corresponding to the two-syllable word "eight and eight” is "ba1_ba1_1”
  • the third middle speech segment to be converted is "ba1_ba1_1”.
  • the third phrase “six six six six” divides three disyllable words “six six", “six six” and “six six” to be converted.
  • the two primary phonetic segments in the middle are “liu1_liu1_0” and “liu1_liu1_1”, the corresponding two-syllable primitive phonetic segment “liuliu” is “liu1_liu1_1”, and the third one
  • the middle and last two primitive speech fragments corresponding to “sixty-six” are “liu1_liu1_1” and “liu1_liu1_2” respectively.
  • the second phrase “six six six six” is obtained.
  • For the audio file set the file name of this audio file to "liu1_liu1_liu1_liu1.wav" temporarily according to the naming rules of the audio file.
  • Step 40 Text and audio splicing: According to the order of the phrases in the text to be converted into speech, the audio files of the obtained phrases are directly spliced into the text speech file in order.
  • the audio files of a phrase When the audio files of a phrase are spliced into a text voice file, they can be spliced directly, but there will be a pause between each phrase. Therefore, preferably, when directly splicing, you can add appropriate between the audio files of each phrase as needed. Length of silence.
  • the method for splicing waveforms based on two-syllable mashups includes the following steps:
  • Step 01 Audio recording: Record the two-syllable words read aloud by a professional customer service, and save the two-syllable words as the original audio file.
  • the audio file here is used for waveform splicing, and Chinese characters have many homophones and different words, in the original audio file, these homophones need only be recorded once.
  • the two-syllable words "balance” and "jiejie” need only be recorded once.
  • the number of disyllabic words is determined by pinyin and tone. Several words with the same pinyin and tone are treated as the same disyllabic word when recording audio.
  • Step 02 Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  • the original audio will have a mute part.
  • the waveform is shown in Figure 4.
  • the part with large ripples in the middle is the pronunciation part, and the part with less ripples at both ends is the mute part.
  • the waveform of the standard audio is shown in Figure 5.
  • Steps 10 to 40 are the same as those in the first embodiment, and details are not described herein again.
  • the present application proposes a wave splicing device based on a two-syllable mashup.
  • the device 20 can be divided into one or more modules.
  • FIG. 6 shows a structural diagram of a first embodiment of the dual-syllable mashup-based wave splicing device 20.
  • the device 20 may be divided into a sound bank production module 201 and a text preprocessing module 202. , Phrase waveform splicing module 203 and text audio splicing module 204. The following description will specifically introduce the specific functions of the modules 201-204.
  • the sound bank production module 201 is configured to divide the standard audio of a two-syllable word into three pieces of audio according to the vowel, and each piece of audio is stored in the sound database as a primitive speech segment required for waveform splicing;
  • the text preprocessing module 202 is used for regularizing the text to be converted into speech, segmenting the regularized text according to the speaking rules to form a phrase, and marking the pinyin and tone;
  • the phrase waveform splicing module 203 is configured to take each two adjacent words in the phrase as a two-syllable word to be converted and use the phrase after the word segmentation as a unit to find the first to-be-transformed phrase in the phrase library.
  • the order of the two-syllable words in the phrase stitching the found primitive speech segments into audio files of the phrase;
  • the text-audio splicing module 204 is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.
  • FIG. 7 shows a structural diagram of a second embodiment of the dual-syllable mashup-based waveform splicing device 20.
  • the dual-syllable mashup-based waveform splicing device 20 can also be divided into sound banks.
  • the production module 201, the text preprocessing module 202, the phrase waveform splicing module 203, the text audio splicing module 204, the audio recording module 205, and the mute segment segmentation module 206 can also be divided into sound banks.
  • the modules 201-204 are the same as those in the first embodiment, and details are not described herein again.
  • the audio recording module 205 is configured to record a two-syllable word read aloud by a professional customer service, and save it as an original audio file in units of the two-syllable word;
  • the mute segment segmentation module 206 is used to cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  • this application also proposes a computer device.
  • FIG. 8 is a schematic diagram of a hardware architecture of a computer device according to an embodiment of the present application.
  • the computer device 2 is a device capable of automatically performing numerical calculation and / or information processing according to an instruction set or stored in advance.
  • it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of multiple servers).
  • the computer device 2 includes at least, but is not limited to, a memory 21, a processor 22, and a network interface 23 which can communicate with each other through a system bus. among them:
  • the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), Static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2.
  • the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, Flash card, etc.
  • the memory 21 may also include both an internal storage unit of the computer device 2 and an external storage device thereof.
  • the memory 21 is generally used to store an operating system and various application software installed on the computer device 2, such as a computer program used to implement the dual-syllable mashup-based waveform splicing method.
  • the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or another data processing chip.
  • the processor 22 is generally used to control the overall operation of the computer device 2, for example, to perform control and processing related to data interaction or communication with the computer device 2.
  • the processor 22 is configured to run program code or process data stored in the memory 21, for example, to run a computer program used to implement the dual-syllable mashup-based waveform splicing method.
  • the network interface 23 may include a wireless network interface or a wired network interface.
  • the network interface 23 is generally used to establish a communication connection between the computer device 2 and other computer devices.
  • the network interface 23 is configured to connect the computer device 2 and an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be an intranet, the Internet, a Global System for Mobile Communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, 5G Wireless, wired or other networks such as Internet, Bluetooth, Wi-Fi.
  • GSM Global System for Mobile Communication
  • WCDMA Wideband Code Division Multiple Access
  • FIG. 8 shows only the computer device 2 with components 21-23, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the computer program stored in the memory 21 for implementing the two-syllable mashup-based waveform splicing method may be executed by one or more processors (processor 22 in this embodiment) to complete The following steps:
  • Step 10 Making a sound bank: Dividing the standard audio of the two-syllable words into three parts of the front, middle, and back according to the vowels, and each piece of audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
  • Step 20 Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
  • Step 30 Splicing of phrase waveforms: using the phrase after the word segmentation as a unit, take each two adjacent words in the phrase as a two-syllable word to be converted, and find the first two-syllable to be converted from the phrase in the sound library. The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;
  • Step 40 Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  • the method before step 10, the method further includes the following steps:
  • Step 01 Audio recording: Record the two-syllable words read aloud by professional customer service, and save the original two-syllable words as the original audio file;
  • Step 02 Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  • the computer-readable storage medium is a non-volatile readable storage medium in which a computer program is stored, and the computer program can be executed by at least one processor to The operation of the above-mentioned two-syllable mashup-based waveform splicing method or device is realized.
  • the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM), a read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the computer-readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital , SD) card, flash memory card (Flash card), etc.
  • the computer-readable storage medium may also include both the internal storage unit of the computer device and its external storage device.
  • the computer-readable storage medium is generally used to store an operating system and various application software installed on a computer device, such as the aforementioned computer program for implementing the dual-syllable mashup-based waveform splicing method.
  • the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.

Abstract

A waveform splicing method based on double syllable mixing, which belongs to the field of speech splicing synthesis. The method comprises: sound library production (step 10): dividing standard audio of a disyllabic word into first, middle and rear sections of audio according to Chinese vowels, with each section of audio being saved into a sound library as a primitive speech segment required for waveform splicing; text preprocessing (step 20): regularizing text to be converted into speech, and word-segmenting the regularized text according to speech rules to form phrases, and marking spelling and tone; phrase waveform splicing (step 30): in units of phrases after word segmentation, using every two adjacent words in phrases as a disyllabic word to be converted, and searching, from the sound library and according to a splicing rule, for a primitive speech segment corresponding to the disyllabic word to be converted; and text audio splicing (step 40): according to the order of each phrase, sequentially splicing an audio file of each phrase into a speech file of the text. According to the present invention, extremely realistic offline and real-time Chinese speech can be synthesized by means of double syllable mixing and Chinese vowel segmentation.

Description

基于双音节混搭的波形拼接方法、装置、设备及存储介质Wave splicing method, device, equipment and storage medium based on dual-syllable mashup
本申请申明享有2018年9月30日递交的申请号为201811153693.2、名称为“基于双音节混搭的波形拼接方法、装置、设备及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application affirms the priority of the Chinese patent application filed on September 30, 2018 with the application number 201811153693.2 and the name "Wave-splicing method, device, equipment and storage medium based on dual-syllable mashups", the entire Chinese patent application The contents are incorporated herein by reference.
技术领域Technical field
本申请涉及语音拼接合成术领域,涉及一种基于双音节混搭的波形拼接方法、装置、设备及存储介质。The present application relates to the field of speech splicing synthesis, and relates to a method, a device, a device, and a storage medium for waveform splicing based on a two-syllable mashup.
背景技术Background technique
现有语音合成方法有基于语音特征参数和基于波形拼接两种方法。相对于基于参数的方法,基于波形拼接的语音合成能够得到质量更高的合成语音,声音听起来也更自然,更为接近原始发音人的音色。因此,目前主流的在线语音合成都是偏重于采用基于波形拼接的语音合成方案。The existing speech synthesis methods include two methods based on speech feature parameters and waveform stitching. Compared with the parameter-based method, the speech synthesis based on waveform splicing can obtain higher-quality synthesized speech, and the sound sounds more natural and closer to the original voice of the person who pronounced it. Therefore, the current mainstream online speech synthesis focuses on the use of waveform splicing-based speech synthesis solutions.
所谓波形拼接就是使用不同长度的录音作为语音库的基础单元,用于合成任意长度的话语。根据输入的文本,拼接音库中相应的基础单元,是一个简单和有效地产生非常自然语音的方案。另一方面从计算复杂度来看,它比其它所有的语音合成方案的复杂度都要小。The so-called waveform splicing is to use recordings of different lengths as the basic unit of the speech database for synthesizing speech of any length. According to the input text, the corresponding basic unit in the splicing sound library is a simple and effective solution for generating very natural speech. On the other hand, in terms of computational complexity, it is less complex than all other speech synthesis schemes.
但是在进行波形拼接前,找出最合适的语音单元是波形拼接的一项重要任务。一个通用的原则是选取的语音单元越长,合成的语音越自然,但是语音库的规模越庞大,大到可能在一定的工程周期内无法覆盖整个连续发音系统。But before waveform splicing, finding the most suitable speech unit is an important task for waveform splicing. A general principle is that the longer the selected speech unit is, the more natural the synthesized speech is, but the larger the size of the speech database is, it may be too large to cover the entire continuous pronunciation system in a certain engineering cycle.
发明内容Summary of the Invention
本申请要解决的技术问题是为了克服现有技术中合成语音的自然度与缩小语音库规模之间的矛盾,提出了一种基于双音节混搭的波形拼接方法、装置、设备及存储介质,既能保证合成高质量的连续语音,又能在较短的时间内覆盖特定场景下的连续发音系统。The technical problem to be solved in this application is to overcome the contradiction between the naturalness of synthesized speech and the reduction of the size of the speech database in the prior art. A method, device, device and storage medium for waveform splicing based on dual-syllable mashup are proposed. It can guarantee the synthesis of high-quality continuous speech, and can cover the continuous pronunciation system in a specific scene in a short time.
本申请是通过下述技术方案来解决上述技术问题:This application solves the above technical problems through the following technical solutions:
一种基于双音节混搭的波形拼接方法,包括以下步骤:A method for waveform splicing based on a two-syllable mashup includes the following steps:
音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形 成短语,并标注拼音和声调;Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
本申请还公开了一种基于双音节混搭的波形拼接装置,包括:This application also discloses a wave splicing device based on a two-syllable mashup, including:
音库制作模块,用于将双音节词的音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;A sound library production module, which is used to divide the audio of a disyllable word into three parts of the front, middle and back according to the vowels, and each piece of audio is saved to the sound library as a primitive speech segment required for waveform splicing;
文本预处理模块,用于将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;A text preprocessing module, which is used to regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
短语波形拼接模块,用于以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Phrase waveform splicing module, which is based on the phrase after the word segmentation as a unit, and regards each two adjacent words in the phrase as a two-syllable word to be converted, and searches the sound library for the first two-syllable word to be converted in the phrase The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;
文本音频拼接模块,用于按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。The text audio splicing module is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.
本申请还公开了一种计算机设备,包括存储器和处理器,所述存储器上存储有计算机程序,所述计算机程序被所述处理器执行时实现如下步骤:The present application also discloses a computer device including a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼 接为所述短语的音频文件;Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
本申请还公开了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序可被至少一个处理器所执行,以实现如下步骤:The present application also discloses a computer-readable storage medium. A computer program is stored in the computer-readable storage medium, and the computer program can be executed by at least one processor to implement the following steps:
音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
本申请的积极进步效果在于:The positive progress effect of this application lies in:
1)通过双音节混搭和韵母分割的技术,可以合成非常逼真的离线和实时中文语音;1) Through the technology of two-syllable mashups and finals segmentation, it can synthesize very realistic offline and real-time Chinese speech;
2)既能保证合成高质量的连续语音,又能在较短的时间内覆盖特定场景下的连续发音系统。2) It can not only ensure the synthesis of high-quality continuous speech, but also cover the continuous pronunciation system in a specific scene in a short time.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1示出了本申请基于双音节混搭的波形拼接方法实施例一的流程图;FIG. 1 shows a flowchart of Embodiment 1 of a method for waveform splicing based on a two-syllable mashup in the present application;
图2示出了本申请基于双音节混搭的波形拼接方法实施例一中文本预处理步骤的流程图;FIG. 2 shows a flowchart of text preprocessing steps in a first embodiment of a two-syllable mashup-based wave splicing method; FIG.
图3示出了本申请基于双音节混搭的波形拼接方法实施例二的流程图;FIG. 3 shows a flowchart of a second embodiment of a method for waveform splicing based on a two-syllable mashup;
图4示出了原始音频波形图;Figure 4 shows the original audio waveform diagram;
图5示出了标准音频波形图;Figure 5 shows a standard audio waveform diagram;
图6示出了本申请基于双音节混搭的波形拼接装置第一实施例的结构图;FIG. 6 shows a structural diagram of a first embodiment of a waveform splicing device based on a two-syllable mashup in the present application; FIG.
图7示出了本申请基于双音节混搭的波形拼接装置第二实施例的结构图;FIG. 7 is a structural diagram of a second embodiment of a waveform splicing device based on a two-syllable mashup in the present application; FIG.
图8示出了本申请计算机设备一实施例的硬件架构示意图。FIG. 8 is a schematic diagram of a hardware architecture of an embodiment of a computer device of the present application.
具体实施方式detailed description
下面通过实施例的方式进一步说明本申请,但并不因此将本申请限制在所述的实施例范围之中。The following further describes the application by way of examples, but the application is not limited to the scope of the examples.
首先,本申请提出一种基于双音节混搭的波形拼接方法。First of all, this application proposes a method for wave stitching based on a two-syllable mashup.
在实施例一中,如图1所示,所述的基于双音节混搭的波形拼接方法包括如下步骤:In the first embodiment, as shown in FIG. 1, the method for splicing waveforms based on a two-syllable mashup includes the following steps:
步骤10、音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中。Step 10. Production of the sound bank: The standard audio of the two-syllable words is divided into front, middle, and back audio according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing.
所谓标准音频是指仅包含有发音部分的音频。The so-called standard audio refers to audio that contains only pronunciation parts.
标准音频切分时,优选为以汉字韵母发声波形(专业客服朗读双音节词时的声音振动会产生声波,声波可以波形的形式展示出来,所述韵母发声波形是指该声波中属于韵母部分的那部分发声对应的波形)中部最高点的左边零点作为分界点。切分后得到的三段音频就作为基元语音片段保存至音库中,保存时,每段基元语音片段的文件名以该基元语音片段对应的双音节词的拼音、声调和段位命名,其中声调一般用数字1-4分别表示第一声至第四声,且每个字的声调直接跟在该字的拼音之后,段位表示切分后三段音频的排序,可以使用数字0-2表示第一段音频至第三段音频。For standard audio segmentation, it is preferred to use the Chinese vowel vowel sound waveform (professional customer service when reading the two-syllable words aloud will generate sound waves, which can be displayed in the form of a waveform. The vowel sound waveform refers to the vowel part of the sound wave The waveform corresponding to that part of the sound) The zero point to the left of the highest point in the middle is used as the demarcation point. The three pieces of audio obtained after segmentation are saved to the sound library as primitive speech fragments. When saving, the file name of each primitive speech fragment is named after the pinyin, tone, and position of the two-syllable word corresponding to the primitive speech fragment. , Where the tones generally use the numbers 1-4 to represent the first to fourth tones, respectively, and the tones of each word directly follow the pinyin of the word. The rank indicates the order of the three audio segments after the segmentation. The numbers 0- 2 indicates the first audio segment to the third audio segment.
例如:双音节词“你好”的标准音频文件为“ni2_hao3.wav”,第一个切分位置为你的元音的中部,第二个切分位置为好的元音的中部;经过切分以后三段音频作为基元语音片段保存到音库中,三段基元语音片段的文件名分别为“ni2_hao3_0.wav”,“ni2_hao3_1.wav”和“ni2_hao3_2.wav”。For example: the standard audio file for the two-syllable word "hello" is "ni2_hao3.wav", the first split position is the middle of your vowel, and the second split position is the middle of the good vowel; After the division, the three audio segments are saved into the sound library as primitive speech segments. The file names of the three primitive speech segments are "ni2_hao3_0.wav", "ni2_hao3_1.wav", and "ni2_hao3_2.wav".
步骤20、文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调。Step 20: Text pre-processing: regularize the text to be converted into speech, segment the words according to the speaking rules to form a phrase, and mark the pinyin and tone.
如图2所示,所述文本预处理具体包括以下三个步骤:As shown in FIG. 2, the text preprocessing specifically includes the following three steps:
步骤21、文本正则化:将所述文本中包含的非中英文字符根据预设的处理规则进行转换,最终使得文本只包含中英文和空格。Step 21: Text regularization: Non-Chinese and English characters included in the text are converted according to a preset processing rule, so that the text contains only Chinese and English and spaces.
其中的英文采用英文的语音波形拼接方法,有别于中文的语音波形拼接方法,本申请仅针对中文的语音波形拼接方法,英文部分在文本正则化过程中作保留处理。Among them, the English speech waveform splicing method is used in English, which is different from the Chinese speech waveform splicing method. This application is only for the Chinese speech waveform splicing method. The English part is reserved during the text regularization process.
预设的处理规则具体可以是,将阿拉伯数字用中文字代替,标点符号用空格代替。例如:将十一位数字电话号码“13888886666”处理为“幺三八八八八八六六六六”。假设其中包含字母,则对字母不做处理。The preset processing rule may specifically be to replace Arabic numerals with Chinese characters and punctuation marks with spaces. For example: The eleven-digit telephone number "13888886666" is processed as "幺 38 888 866-6666". Assuming letters are included, the letters are not processed.
步骤22、文本分词:根据汉语的说话规则将所述文本划分为若干个短语,并在各个短语之间加入空格以示停顿。 Step 22. Text segmentation: divide the text into several phrases according to the Chinese speaking rules, and add a space between each phrase to indicate a pause.
所述说话规则为汉语言朗读时的断句规则。以电话号码为例,区号+7位或8位号码,我们习惯说完区号后停顿,7位或8位号码则通常会被划分为两部分并在中间加以停顿;以阅读为例,通常遇到标点符号会加以停顿,长句中间也会加以停顿。The speaking rule is a sentence segmentation rule when the Chinese language is read aloud. Take the telephone number as an example, the area code + 7 or 8 digit number, we are used to pause after speaking the area code, the 7 or 8 digit number is usually divided into two parts and paused in the middle; taking reading as an example, usually encounter The punctuation marks are paused, and the long sentence is paused.
例如:前述的电话号码“幺三八八八八八六六六六”经过分词后为“幺三八 八八八八 六六六六”。假设其中包含字母,则将连续的字母按类似一个短语的处理,例如“一二三BC四五”经过分词后为“一二三BC四五”。For example, the aforementioned phone number "幺 三八 八八 八八 六六六 六" after segmentation is "幺 三八 八八 八八 six six six six six". Assuming that letters are included, consecutive letters are treated like a phrase, for example, "one two three BC four five" after segmentation is "one two three BC four five".
步骤23、拼音标注:给分词后的所述文本标注拼音和声调。其中声调用数字1-4表示。Step 23: Pinyin labeling: label the text after the word segmentation with pinyin and tone. Among them, sound call numbers 1-4 indicate.
例如:前述分词后的文本“幺三八 八八八八 六六六六”标注的拼音为“yao1 san1 ba1 ba1 ba1 ba1 ba1 liu4 liu4 liu4 liu4”,其中每两个字对应的拼音之间的空格可用于代表设定的可以调节的空白时长。For example: after the text word "unitary 3888886666" alphabet is marked "yao1 san1 ba1 ba1 ba1 ba1 ba1 liu4 liu4 liu4 liu4", wherein a space between each word corresponding to phonetic It can be used to represent the adjustable blank time.
步骤30、短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件。Step 30. Splicing of phrase waveforms: using the phrase after the word segmentation as a unit, take each two adjacent words in the phrase as a two-syllable word to be converted, and find the first two-syllable to be converted from the phrase in the sound library. The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, the found primitive speech segments are spliced into an audio file of the phrase in turn.
分词后的每个短语的音频是最小的音频文件,这个最小的音频文件由若干基元语音片段经拼接后得到。The audio of each phrase after the word segmentation is the smallest audio file. This smallest audio file is obtained by splicing a number of primitive speech fragments.
这里的短语由一句话中无需停顿的若干字和/或词组组成,由于基元语音片段是从双音节词的音频中切分而来的,需要通过两两语音波形的拼接,以达到语音揉合的目的。因此,这里将短语中每两个相邻的字作为一个待转化双音节词进行划分,即假设短语由n个字组成,那么通过划分会得到n+1个待转化双音节词,每个待转化双音节词中的第二个字就是后一个待转化双音节词中的第一个字。需要注意的是,划分得到的n+1个待转化双音节词按其在短语中的顺序进行排序,以判断n+1个待转化双音节词中的第1个待转化双音节词和第n+1个待转化双音节词。The phrase here is composed of several words and / or phrases without pause in a sentence. Since the primitive speech segments are cut from the audio of the two-syllable words, the splicing of pair of speech waveforms is needed to achieve speech kneading. Suitable purpose. Therefore, here every two adjacent words in the phrase are divided as one disyllable word to be converted, that is, assuming that the phrase is composed of n words, then n + 1 disyllable words to be converted will be obtained by division. The second word in the converted disyllable word is the first word in the next two-syllable word to be converted. It should be noted that the n + 1 disyllable words to be converted are sorted according to their order in the phrase to determine the first disyllable word to be converted and the first one of the n + 1 disyllable words to be converted. n + 1 disyllable words to be converted.
在将短语划分为n+1个待转化双音节词的同时,所述短语上对应标注的拼音和声调也以相同的规则进行划分,并且划分后的n+1对拼音和声调和划分后的n+1个待转化双音节词是一一对应的。需要注意的是,标注的拼音和声调是与所述短语中的每个字一一对应的,即短语中的每个字都会对应标注上一对拼音和声调,识别时,每识别到一个数字就表示为一个字上对应标注的结束。以前述11位的电话号码中的第一个短语“幺三八”为例,将“幺三八”划分出两个待转化双音节词“幺三”和“三八”的同时,该短语上对应标注的拼音和声调“yao1 san1 ba1”也以相同规则进行划分。从第一个字母y开始,识别到第一个1时,表示第一个字“幺”对应的拼音和声调结束即“yao1”,接着从后面一个字母s开始, 识别到第二个1时,表示第二个字“三”对应的拼音和声调结束即“san1”,进而得到第一个待转化双音节词“幺三”对应的拼音和声调即“yao1 san1”,第二个待转化双音节词“三八”对应的拼音和声调划分同上,此处不再赘述。根据每个待转化双音节词上标注的拼音和声调,具体而言就是以待转化双音节词为单位,获取待转化双音节词对应的标注拼音和声调的文本,从音库中查找文件名中包含有该待转化双音节词的标注的拼音和声调的基元语音片段而根据拼接规则,首个双音节词取其相对应的前、中两段基元语音片段,末个双音节取其相对应的中、后两个基元语音片段,如果中间还包含有其他双音节词,那么只取其相对应的中段基元语音片段。也就是说,假设短语由n个字组成,那么其应该由n+2个基元语音片段拼接而成。When the phrase is divided into n + 1 disyllable words to be converted, the corresponding pinyin and tone of the phrase are also divided according to the same rules, and the divided n + 1 pairs of pinyin and tone and the divided The n + 1 disyllable words to be converted correspond one-to-one. It should be noted that the marked pinyin and tone correspond one-to-one to each word in the phrase, that is, each word in the phrase will be marked with a pair of pinyin and tone. When identifying, each number is recognized It is expressed as the end of the corresponding label on a word. Taking the first phrase "幺 三八" in the aforementioned 11-digit telephone number as an example, the "幺 三八" is divided into two disyllable words "幺 三" and "三八" to be converted. At the same time, the phrase The corresponding pinyin and tones labeled "yao1, san1, and ba1" are also divided according to the same rules. Starting from the first letter y, when the first 1 is recognized, it means the end of the pinyin and tone corresponding to the first word "幺" is "yao1", and then starting from the next letter s, the second 1 is recognized , Indicating the end of the pinyin and tone corresponding to the second word "three" is "san1", and then the first pinyin and tone corresponding to the two-syllable word "幺 三" is converted to "yao1 san1", the second to be converted The pinyin and tone division of the two-syllable word "three-eight" is the same as above, and is not repeated here. According to the pinyin and tone marked on each disyllable word to be converted, specifically, using the disyllable word to be converted as a unit, obtain the text marked with pinyin and tone corresponding to the disyllable word to be converted, and find the file name from the phonetic library The phonetic segments containing the marked pinyin and tones of the two-syllable word to be converted are included. According to the splicing rule, the first two-syllable words take their corresponding first and middle primitive phonetic fragments, and the last two-syllable take The corresponding middle and last two primitive speech fragments, if there are other two-syllable words in the middle, only the corresponding middle primitive speech fragments are taken. That is, assuming that a phrase consists of n words, it should be composed of n + 2 primitive speech segments.
以前述11位的电话号码为例:Take the aforementioned 11-digit telephone number as an example:
第一个短语“幺三八”划分出两个待转化双音节词“幺三”和“三八”,找到待转化双音节词“幺三”对应的前、中两段基元语音片段分别为“yao1_san1_0.wav”和“yao1_san1_1.wav”,找到待转化双音节词“三八”对应的中、后两个基元语音片段分别为“san1_ba1_1”和“san1_ba1_2”,将这四段基元语音片段通过波形拼接后,就得到了第一个短语“幺三八”的音频文件,根据音频文件的命名规则(所述命名规则就是文件名与短语上标注的拼音和声调相对应,然后在拼音和声调后加个后缀作为文件名),将这个音频文件的文件名设为“yao1_san1_ba1.wav”暂存。The first phrase "幺 三八" divides the two disyllable words "幺 三" and "三八" to be converted, and finds the first and middle two phonetic fragments corresponding to the double syllable word "幺 三" to be converted, respectively. For "yao1_san1_0.wav" and "yao1_san1_1.wav", find the middle and last two primitive speech fragments corresponding to the two-syllable word "three-eight" to be converted into "san1_ba1_1" and "san1_ba1_2" respectively. After the voice clips are spliced through the waveform, the first phrase "幺 38" is obtained. According to the naming rules of the audio file (the naming rule is that the file name corresponds to the pinyin and tone marked on the phrase, then Pinyin and tone are added with a suffix as the file name), and the file name of this audio file is set to "yao1_san1_ba1.wav" for temporary storage.
第二个短语“八八八八”划分出三个待转化双音节词“八八”、“八八”和“八八”,第一个待转化双音节词“八八”对应的前、中两段基元语音片段分别为“ba1_ba1_0”和“ba1_ba1_1”,第二个待转化双音节词“八八”对应的中段基元语音片段为“ba1_ba1_1”,第三个待转化双音节词“八八”对应的中、后两个基元语音片段分别为“ba1_ba1_1”和“ba1_ba1_2”,将这五段基元语音片段通过波形拼接后,就得到了第二个短语“八八八八”的音频文件,根据音频文件的命名规则,将这个音频文件的文件名设为“ba1_ba1_ba1_ba1.wav”暂存。The second phrase "eight-eight-eight-eight" divides three to-be-converted two-syllable words "eight-eight", "eight-eight", and "eight-eight". The first two-syllable word "eight-eight" corresponding to the first, The middle two primary speech segments are "ba1_ba1_0" and "ba1_ba1_1", the second middle speech segment corresponding to the two-syllable word "eight and eight" is "ba1_ba1_1", and the third middle speech segment to be converted is "ba1_ba1_1". The middle and last two primitive speech fragments corresponding to "eight eight" are "ba1_ba1_1" and "ba1_ba1_2" respectively. After combining these five basic speech fragments with waveforms, we get the second phrase "eight eight eight" For the audio file, set the file name of this audio file to "ba1_ba1_ba1_ba1.wav" temporarily according to the naming rules of the audio file.
第三个短语“六六六六”划分出三个待转化双音节词“六六”、“六六”和“六六”,第一个待转化双音节词“六六”对应的前、中两段基元语音片段分别为“liu1_liu1_0”和“liu1_liu1_1”,第二个待转化双音节词“六六”对应的中段基元语音片段为“liu1_liu1_1”,第三个待转化双音节词“六六”对应的中、后两个基元语音片段分别为“liu1_liu1_1”和“liu1_liu1_2”,将这五段基元语音片段通过波形拼接后,就得到了第二个短语“六六六六”的音频文件,根据音频文件的命名规则,将这个音频文件的文件名设为“liu1_liu1_liu1_liu1.wav”暂存。The third phrase "six six six six" divides three disyllable words "six six", "six six" and "six six" to be converted. The two primary phonetic segments in the middle are "liu1_liu1_0" and "liu1_liu1_1", the corresponding two-syllable primitive phonetic segment "liuliu" is "liu1_liu1_1", and the third one The middle and last two primitive speech fragments corresponding to “sixty-six” are “liu1_liu1_1” and “liu1_liu1_2” respectively. After combining these five primitive speech fragments with waveforms, the second phrase “six six six six” is obtained. For the audio file, set the file name of this audio file to "liu1_liu1_liu1_liu1.wav" temporarily according to the naming rules of the audio file.
步骤40、文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的 各个短语的音频文件依次直接拼接为所述文本的语音文件。Step 40: Text and audio splicing: According to the order of the phrases in the text to be converted into speech, the audio files of the obtained phrases are directly spliced into the text speech file in order.
短语的音频文件拼接为文本的语音文件时,直接拼接即可,但是由于各个短语之间会有停顿,因此,优选地,在直接拼接时,可以根据需要在各短语的音频文件之间加入适当长度的静音。When the audio files of a phrase are spliced into a text voice file, they can be spliced directly, but there will be a pause between each phrase. Therefore, preferably, when directly splicing, you can add appropriate between the audio files of each phrase as needed. Length of silence.
在实施例二中,基于实施例一的基础上,如图3所示,所述的基于双音节混搭的波形拼接方法包括如下步骤:In the second embodiment, based on the first embodiment, as shown in FIG. 3, the method for splicing waveforms based on two-syllable mashups includes the following steps:
步骤01、音频录制:录制专业客服朗读的双音节词,并以双音节词为单位保存为原始音频文件。Step 01: Audio recording: Record the two-syllable words read aloud by a professional customer service, and save the two-syllable words as the original audio file.
由于这里的音频文件是用作波形拼接的,而中文字有很多同音不同字,在录原始音频文件中,这些同音不同字只需录一次即可。例如:双音节词“结余”和“婕妤”,只需要录一次即可。换言之,双音节词的数量由拼音和声调决定,具有相同的拼音和声调的若干个词语,在录制音频的时候作为同一个双音节词处理。Because the audio file here is used for waveform splicing, and Chinese characters have many homophones and different words, in the original audio file, these homophones need only be recorded once. For example: the two-syllable words "balance" and "jiejie" need only be recorded once. In other words, the number of disyllabic words is determined by pinyin and tone. Several words with the same pinyin and tone are treated as the same disyllabic word when recording audio.
步骤02、静音段分割:切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。Step 02: Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
通常来说,原始音频都会有静音部分,波形如图4所示,中间波纹起伏大的部分为发音部分,两端波纹起伏较小的部分为静音部分。经静音部分切除后,得到标准音频的波形如图5所示。Generally speaking, the original audio will have a mute part. The waveform is shown in Figure 4. The part with large ripples in the middle is the pronunciation part, and the part with less ripples at both ends is the mute part. After the mute part is cut off, the waveform of the standard audio is shown in Figure 5.
步骤10-40同实施例一,此处不再赘述。Steps 10 to 40 are the same as those in the first embodiment, and details are not described herein again.
其次,本申请提出了一种基于双音节混搭的波形拼接装置,所述装置20可以被分割为一个或者多个模块。Secondly, the present application proposes a wave splicing device based on a two-syllable mashup. The device 20 can be divided into one or more modules.
例如,图6示出了所述基于双音节混搭的波形拼接装置20第一实施例的结构图,该实施例中,所述装置20可以被分割为音库制作模块201、文本预处理模块202、短语波形拼接模块203和文本音频拼接模块204。以下描述将具体介绍所述模块201-204的具体功能。For example, FIG. 6 shows a structural diagram of a first embodiment of the dual-syllable mashup-based wave splicing device 20. In this embodiment, the device 20 may be divided into a sound bank production module 201 and a text preprocessing module 202. , Phrase waveform splicing module 203 and text audio splicing module 204. The following description will specifically introduce the specific functions of the modules 201-204.
所述音库制作模块201用于将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;The sound bank production module 201 is configured to divide the standard audio of a two-syllable word into three pieces of audio according to the vowel, and each piece of audio is stored in the sound database as a primitive speech segment required for waveform splicing;
所述文本预处理模块202用于将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;The text preprocessing module 202 is used for regularizing the text to be converted into speech, segmenting the regularized text according to the speaking rules to form a phrase, and marking the pinyin and tone;
所述短语波形拼接模块203用于以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节 词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;The phrase waveform splicing module 203 is configured to take each two adjacent words in the phrase as a two-syllable word to be converted and use the phrase after the word segmentation as a unit to find the first to-be-transformed phrase in the phrase library. The first and middle two-syllabic phonetic snippets of the disyllabic word, the last two syllabic phonetic snippets of the last two-syllable word to be converted, and the other middle-speech syllabic words of the two-syllabic word to be converted. The order of the two-syllable words in the phrase, stitching the found primitive speech segments into audio files of the phrase;
所述文本音频拼接模块204用于按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。The text-audio splicing module 204 is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.
又例如,图7示出了所述基于双音节混搭的波形拼接装置20第二实施例的结构图,该实施例中,所述基于双音节混搭的波形拼接装置20还可以被分割为音库制作模块201、文本预处理模块202、短语波形拼接模块203、文本音频拼接模块204、音频录制模块205和静音段分割模块206。As another example, FIG. 7 shows a structural diagram of a second embodiment of the dual-syllable mashup-based waveform splicing device 20. In this embodiment, the dual-syllable mashup-based waveform splicing device 20 can also be divided into sound banks. The production module 201, the text preprocessing module 202, the phrase waveform splicing module 203, the text audio splicing module 204, the audio recording module 205, and the mute segment segmentation module 206.
其中,模块201-204同第一实施例,此处不再赘述。The modules 201-204 are the same as those in the first embodiment, and details are not described herein again.
所述音频录制模块205用于录制专业客服朗读的双音节词,并以双音节词为单位保存为原始音频文件;The audio recording module 205 is configured to record a two-syllable word read aloud by a professional customer service, and save it as an original audio file in units of the two-syllable word;
所述静音段分割模块206用于切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。The mute segment segmentation module 206 is used to cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
再次,本申请还提出来一种计算机设备。Again, this application also proposes a computer device.
参阅图8所示,是本申请计算机设备一实施例的硬件架构示意图。本实施例中,所述计算机设备2是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。例如,可以是智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。如图所示,所述计算机设备2至少包括,但不限于,可通过系统总线相互通信连接存储器21、处理器22以及网络接口23。其中:FIG. 8 is a schematic diagram of a hardware architecture of a computer device according to an embodiment of the present application. In this embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and / or information processing according to an instruction set or stored in advance. For example, it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of multiple servers). As shown in the figure, the computer device 2 includes at least, but is not limited to, a memory 21, a processor 22, and a network interface 23 which can communicate with each other through a system bus. among them:
所述存储器21至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,所述存储器21也可以是所述计算机设备2的外部存储设备,例如该计算机设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述计算机设备2的操作系统和各类应用软件,例如用于实现所述基于双音节混搭的波形 拼接方法的计算机程序等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), Static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, Flash card, etc. Of course, the memory 21 may also include both an internal storage unit of the computer device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used to store an operating system and various application software installed on the computer device 2, such as a computer program used to implement the dual-syllable mashup-based waveform splicing method. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述计算机设备2的总体操作,例如执行与所述计算机设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行用于实现所述基于双音节混搭的波形拼接方法的计算机程序等。In some embodiments, the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or another data processing chip. The processor 22 is generally used to control the overall operation of the computer device 2, for example, to perform control and processing related to data interaction or communication with the computer device 2. In this embodiment, the processor 22 is configured to run program code or process data stored in the memory 21, for example, to run a computer program used to implement the dual-syllable mashup-based waveform splicing method.
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述计算机设备2与其他计算机设备之间建立通信连接。例如,所述网络接口23用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。The network interface 23 may include a wireless network interface or a wired network interface. The network interface 23 is generally used to establish a communication connection between the computer device 2 and other computer devices. For example, the network interface 23 is configured to connect the computer device 2 and an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be an intranet, the Internet, a Global System for Mobile Communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, 5G Wireless, wired or other networks such as Internet, Bluetooth, Wi-Fi.
需要指出的是,图8仅示出了具有组件21-23的计算机设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。It should be noted that FIG. 8 shows only the computer device 2 with components 21-23, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
在本实施例中,存储于存储器21中的用于实现所述基于双音节混搭的波形拼接方法的计算机程序可以被一个或多个处理器(本实施例为处理器22)所执行,以完成以下步骤的操作:In this embodiment, the computer program stored in the memory 21 for implementing the two-syllable mashup-based waveform splicing method may be executed by one or more processors (processor 22 in this embodiment) to complete The following steps:
步骤10、音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Step 10: Making a sound bank: Dividing the standard audio of the two-syllable words into three parts of the front, middle, and back according to the vowels, and each piece of audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
步骤20、文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;Step 20: Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
步骤30、短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Step 30. Splicing of phrase waveforms: using the phrase after the word segmentation as a unit, take each two adjacent words in the phrase as a two-syllable word to be converted, and find the first two-syllable to be converted from the phrase in the sound library. The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;
步骤40、文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。Step 40: Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
在一实施例中,在步骤10之前还包括以下步骤:In an embodiment, before step 10, the method further includes the following steps:
步骤01、音频录制:录制专业客服朗读的双音节词,并以双音节词为单位保存为原始 音频文件;Step 01: Audio recording: Record the two-syllable words read aloud by professional customer service, and save the original two-syllable words as the original audio file;
步骤02、静音段分割:切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。Step 02: Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
此外,本申请一种计算机可读存储介质,所述计算机可读存储介质为非易失性可读存储介质,其内存储有计算机程序,所述计算机程序可被至少一个处理器所执行,以实现上述基于双音节混搭的波形拼接方法或装置的操作。In addition, a computer-readable storage medium is provided in the present application. The computer-readable storage medium is a non-volatile readable storage medium in which a computer program is stored, and the computer program can be executed by at least one processor to The operation of the above-mentioned two-syllable mashup-based waveform splicing method or device is realized.
其中,计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,计算机可读存储介质可以是计算机设备的内部存储单元,例如该计算机设备的硬盘或内存。在另一些实施例中,计算机可读存储介质也可以是计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,计算机可读存储介质还可以既包括计算机设备的内部存储单元也包括其外部存储设备。本实施例中,计算机可读存储介质通常用于存储安装于计算机设备的操作系统和各类应用软件,例如前述用于实现所述基于双音节混搭的波形拼接方法的计算机程序等。此外,计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的各类数据。The computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM), a read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the computer-readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer-readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital , SD) card, flash memory card (Flash card), etc. Of course, the computer-readable storage medium may also include both the internal storage unit of the computer device and its external storage device. In this embodiment, the computer-readable storage medium is generally used to store an operating system and various application software installed on a computer device, such as the aforementioned computer program for implementing the dual-syllable mashup-based waveform splicing method. In addition, the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.
虽然以上描述了本申请的具体实施方式,但是本领域的技术人员应当理解,这仅是举例说明,本申请的保护范围是由所附权利要求书限定的。本领域的技术人员在不背离本申请的原理和实质的前提下,可以对这些实施方式做出多种变更或修改,但这些变更和修改均落入本申请的保护范围。Although the specific implementation of the present application is described above, those skilled in the art should understand that this is only an example, and the protection scope of the present application is defined by the appended claims. Those skilled in the art can make various changes or modifications to these embodiments without departing from the principle and essence of this application, but these changes and modifications fall within the protection scope of this application.

Claims (20)

  1. 一种基于双音节混搭的波形拼接方法,其特征在于,包括以下步骤:A method for waveform splicing based on a two-syllable mashup, which includes the following steps:
    音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
    文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
    短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
    文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  2. 根据权利要求1所述的基于双音节混搭的波形拼接方法,其特征在于,在音库制作之前还包括以下步骤:The method for wave stitching based on dual-syllable mashups according to claim 1, further comprising the following steps before the production of the sound bank:
    音频录制:录制专业客服朗读的双音节词,并以双音节词为单位保存为原始音频文件;Audio recording: Record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;
    静音段分割:切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  3. 根据权利要求1或2所述的基于双音节混搭的波形拼接方法,其特征在于,所述基元语音片段的文件名以该基元语音片段对应的双音节词的拼音、声调和段位命名。The method according to claim 1 or 2, wherein the file name of the elementary speech segment is named after the pinyin, tone, and position of the two-syllable word corresponding to the elementary speech segment.
  4. 根据权利要求1或2所述的基于双音节混搭的波形拼接方法,其特征在于,所述将双音节词的音频按韵母切分为前、中、后三段音频时,以汉字韵母发声波形中部最高点的左边零点作为分界点。The method for splicing waveforms based on dual-syllable mashups according to claim 1 or 2, characterized in that, when the audio of a dual-syllable word is divided into three sections of audio: front, middle, and last according to the vowel, the waveform is uttered with the Chinese vowel. The zero point to the left of the highest point in the middle is the demarcation point.
  5. 根据权利要求1或2所述的基于双音节混搭的波形拼接方法,其特征在于,所述文本预处理具体包括以下步骤:The method for wave stitching based on a two-syllable mashup according to claim 1 or 2, wherein the text preprocessing specifically includes the following steps:
    文本正则化:将所述文本中包含的非中英文字符根据预设的处理规则进行转换;Text regularization: converting non-Chinese and English characters contained in the text according to a preset processing rule;
    文本分词:根据汉语的说话习惯将所述文本划分为若干个短语,并在各个短语之间加入空格以示停顿;Text word segmentation: divide the text into several phrases according to Chinese speaking habits, and add a space between each phrase to indicate a pause;
    拼音标注:给分词后的所述文本标注拼音和声调。Pinyin labeling: Label the text after the word segmentation with pinyin and tone.
  6. 根据权利要求3所述的基于双音节混搭的波形拼接方法,其特征在于,在所述 短语波形拼接中,根据各个所述待转化双音节词上标注的拼音和声调,从所述音库中查找文件名中包含有所述双音节词上标注的拼音和声调的基元语音片段;再根据拼接规则,从查找到的基元语音片段中获取文件名中包含有相应段位的基元语音片段。The method according to claim 3, characterized in that, in the phrase waveform splicing, according to the pinyin and tone marked on each of the two-syllable words to be converted, from the sound bank Look for a primitive speech segment in which the file name contains the pinyin and tones marked on the two-syllable word; and then obtain the primitive speech segment in which the filename contains the corresponding segment from the found primitive speech segment according to the stitching rules .
  7. 一种基于双音节混搭的波形拼接装置,其特征在于,包括:A waveform splicing device based on a two-syllable mashup, which includes:
    音库制作模块,用于将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;A sound bank production module, which is used to divide the standard audio of a two-syllable word into three parts of front, middle, and back according to the vowels, and each piece of audio is saved to the sound bank as the primitive speech segment required for waveform splicing;
    文本预处理模块,用于将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;A text preprocessing module, which is used to regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
    短语波形拼接模块,用于以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Phrase waveform splicing module, which is based on the phrase after the word segmentation as a unit, and regards each two adjacent words in the phrase as a two-syllable word to be converted, and searches the sound library for the first two-syllable word to be converted in the phrase The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;
    文本音频拼接模块,用于按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。The text audio splicing module is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.
  8. 根据权利要求7所述的基于双音节混搭的波形拼接装置,其特征在于,还包括:The dual-syllable mashup-based wave splicing device according to claim 7, further comprising:
    音频录制模块,用于录制专业客服朗读的双音节词,并以双音节词为单位保存为原始音频文件;Audio recording module, which is used to record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;
    静音段分割模块,用于切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。The mute segment segmentation module is used to cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  9. 一种计算机设备,包括存储器和处理器,其特征在于,所述存储器上存储有计算机程序,所述计算机程序被所述处理器执行时实现如下步骤:A computer device includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:
    音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
    文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
    短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段基元语音片段依次拼接为所述短语的音频文件;Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
    文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语 的音频文件依次直接拼接为所述文本的语音文件。Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  10. 根据权利要求9所述的计算机设备,其特征在于,在音库制作之前还包括以下步骤:The computer device according to claim 9, further comprising the following steps before making the sound bank:
    音频录制:录制专业客服朗读的双音节词,并以双音节词为单位保存为原始音频文件;Audio recording: Record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;
    静音段分割:切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  11. 根据权利要求9或10所述的计算机设备,其特征在于,所述基元语音片段的文件名以该基元语音片段对应的双音节词的拼音、声调和段位命名。The computer device according to claim 9 or 10, wherein a file name of the primitive speech segment is named after a pinyin, a tone, and a position of a two-syllable word corresponding to the primitive speech segment.
  12. 根据权利要求9或10所述的计算机设备,其特征在于,所述将双音节词的音频按韵母切分为前、中、后三段音频时,以汉字韵母发声波形中部最高点的左边零点作为分界点。The computer device according to claim 9 or 10, characterized in that when the audio of a disyllable word is divided into three parts of front, middle and back according to the vowel, the left zero of the highest point in the middle of the utterance waveform of the Chinese vowel is uttered As a demarcation point.
  13. 根据权利要求9或10所述的计算机设备,其特征在于,所述文本预处理具体包括以下步骤:The computer device according to claim 9 or 10, wherein the text preprocessing specifically includes the following steps:
    文本正则化:将所述文本中包含的非中英文字符根据预设的处理规则进行转换;Text regularization: converting non-Chinese and English characters contained in the text according to a preset processing rule;
    文本分词:根据汉语的说话习惯将所述文本划分为若干个短语,并在各个短语之间加入空格以示停顿;Text word segmentation: divide the text into several phrases according to Chinese speaking habits, and add a space between each phrase to indicate a pause;
    拼音标注:给分词后的所述文本标注拼音和声调。Pinyin labeling: Label the text after the word segmentation with pinyin and tone.
  14. 根据权利要求11所述的计算机设备,其特征在于,在所述短语波形拼接中,根据各个所述待转化双音节词上标注的拼音和声调,从所述音库中查找文件名中包含有所述双音节词上标注的拼音和声调的基元语音片段;再根据拼接规则,从查找到的基元语音片段中获取文件名中包含有相应段位的基元语音片段。The computer device according to claim 11, characterized in that, in the phrase waveform splicing, according to the pinyin and tone marked on each of the two-syllable words to be converted, the file name is looked up from the sound library and contains Primitive phonetic fragments marked with pinyin and tones marked on the two-syllable words; and based on the stitching rules, primitive phonetic fragments containing the corresponding segments in the file name are obtained from the found primitive phonetic fragments.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质内存储有计算机程序,所述计算机程序可被至少一个处理器所执行,以实现如如下步骤:A computer-readable storage medium is characterized in that a computer program is stored in the computer-readable storage medium, and the computer program can be executed by at least one processor to implement the following steps:
    音库制作:将双音节词的标准音频按韵母切分为前、中、后三段音频,每段音频作为波形拼接所需的基元语音片段保存至音库中;Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;
    文本预处理:将待转化成语音的文本正则化,对正则化后的文本按说话规则分词以形成短语,并标注拼音和声调;Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;
    短语波形拼接:以分词之后的短语为单位,将所述短语中每两个相邻的字作为一个待转化双音节词,从音库中查找所述短语中首个待转化双音节词的前、中两段基元语音片段、末个待转化双音节词的中、后两个基元语音片段以及其他待转化双音节词的中段基元语音片段,并按各个待转化双音节词在所述短语中的顺序,将查到的各段 基元语音片段依次拼接为所述短语的音频文件;Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;
    文本音频拼接:按各个短语在所述待转化成语音的文本中的顺序,将获得的各个短语的音频文件依次直接拼接为所述文本的语音文件。Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
  16. 根据权利要求15所述的计算机可读存储介质,其特征在于,在音库制作之前还包括以下步骤:The computer-readable storage medium according to claim 15, further comprising the following steps before making the sound bank:
    音频录制:录制专业客服朗读的双音节词,并以双音节词为单位保存为原始音频文件;Audio recording: Record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;
    静音段分割:切除所述原始音频文件中音频前后的静音部分,将所述音频中的发音部分作为所述双音节词的标准音频保存。Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
  17. 根据权利要求15或16所述的计算机可读存储介质,其特征在于,所述基元语音片段的文件名以该基元语音片段对应的双音节词的拼音、声调和段位命名。The computer-readable storage medium according to claim 15 or 16, characterized in that the file name of the primitive speech segment is named after the pinyin, tone, and position of the two-syllable word corresponding to the primitive speech segment.
  18. 根据权利要求15或16所述的计算机可读存储介质,其特征在于,所述将双音节词的音频按韵母切分为前、中、后三段音频时,以汉字韵母发声波形中部最高点的左边零点作为分界点。The computer-readable storage medium according to claim 15 or 16, characterized in that when the audio of a disyllable word is divided into three parts of front, middle, and back according to the vowel, the highest point in the middle of the utterance waveform of the Chinese vowel is used. The zero point on the left is the demarcation point.
  19. 根据权利要求15或16所述的计算机可读存储介质,其特征在于,所述文本预处理具体包括以下步骤:The computer-readable storage medium according to claim 15 or 16, wherein the text preprocessing specifically includes the following steps:
    文本正则化:将所述文本中包含的非中英文字符根据预设的处理规则进行转换;Text regularization: converting non-Chinese and English characters contained in the text according to a preset processing rule;
    文本分词:根据汉语的说话习惯将所述文本划分为若干个短语,并在各个短语之间加入空格以示停顿;Text word segmentation: divide the text into several phrases according to Chinese speaking habits, and add a space between each phrase to indicate a pause;
    拼音标注:给分词后的所述文本标注拼音和声调。Pinyin labeling: Label the text after the word segmentation with pinyin and tone.
  20. 根据权利要求17所述的计算机可读存储介质,其特征在于,在所述短语波形拼接中,根据各个所述待转化双音节词上标注的拼音和声调,从所述音库中查找文件名中包含有所述双音节词上标注的拼音和声调的基元语音片段;再根据拼接规则,从查找到的基元语音片段中获取文件名中包含有相应段位的基元语音片段。The computer-readable storage medium according to claim 17, wherein in the phrase waveform splicing, a file name is searched from the sound bank according to the pinyin and tone marked on each of the two-syllable words to be converted. The phonetic primitives include the phonetic segments of the pinyin and tone marked on the two-syllable words, and the primitive voice segments containing the corresponding segments in the file name are obtained from the found primitive voice segments according to the stitching rules.
PCT/CN2018/124440 2018-09-30 2018-12-27 Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium WO2020062680A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811153693.2A CN109389968B (en) 2018-09-30 2018-09-30 Waveform splicing method, device, equipment and storage medium based on double syllable mixing and lapping
CN201811153693.2 2018-09-30

Publications (1)

Publication Number Publication Date
WO2020062680A1 true WO2020062680A1 (en) 2020-04-02

Family

ID=65419113

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124440 WO2020062680A1 (en) 2018-09-30 2018-12-27 Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium

Country Status (2)

Country Link
CN (1) CN109389968B (en)
WO (1) WO2020062680A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307183A (en) * 2020-10-30 2021-02-02 北京金堤征信服务有限公司 Search data identification method and device, electronic equipment and computer storage medium
CN113674731A (en) * 2021-05-14 2021-11-19 北京搜狗科技发展有限公司 Speech synthesis processing method, apparatus and medium
CN117672182A (en) * 2024-02-02 2024-03-08 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189744A (en) * 2019-04-09 2019-08-30 阿里巴巴集团控股有限公司 The method, apparatus and electronic equipment of text-processing
CN112562637B (en) * 2019-09-25 2024-02-06 北京中关村科金技术有限公司 Method, device and storage medium for splicing voice audios
CN111145722B (en) * 2019-12-30 2022-09-02 出门问问信息科技有限公司 Text processing method and device, computer storage medium and electronic equipment
CN111145723B (en) * 2019-12-31 2023-11-17 广州酷狗计算机科技有限公司 Method, device, equipment and storage medium for converting audio
CN110956948A (en) * 2020-01-03 2020-04-03 北京海天瑞声科技股份有限公司 End-to-end speech synthesis method, device and storage medium
CN111341293B (en) * 2020-03-09 2022-11-18 广州市百果园信息技术有限公司 Text voice front-end conversion method, device, equipment and storage medium
CN111564153B (en) * 2020-04-02 2021-10-01 湖南声广科技有限公司 Intelligent broadcasting music program system of broadcasting station
CN112102810A (en) * 2020-09-22 2020-12-18 深圳追一科技有限公司 Voice synthesis method, system and related equipment
CN112667865A (en) * 2020-12-29 2021-04-16 西安掌上盛唐网络信息有限公司 Method and system for applying Chinese-English mixed speech synthesis technology to Chinese language teaching

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN87100922A (en) * 1987-02-21 1988-11-16 杭州自动化研究所 The head and the tail splicing synthetic method of Chinese characters computer voice
CN1455386A (en) * 2002-11-01 2003-11-12 中国科学院声学研究所 Imbedded voice synthesis method and system
CN104318920A (en) * 2014-10-07 2015-01-28 北京理工大学 Construction method of cross-syllable Chinese speech synthesis element with spectrum stable boundary

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1111811C (en) * 1997-04-14 2003-06-18 英业达股份有限公司 Articulation compounding method for computer phonetic signal
US6108627A (en) * 1997-10-31 2000-08-22 Nortel Networks Corporation Automatic transcription tool
CN1811912B (en) * 2005-01-28 2011-06-15 北京捷通华声语音技术有限公司 Minor sound base phonetic synthesis method
CN101000765B (en) * 2007-01-09 2011-03-30 黑龙江大学 Speech synthetic method based on rhythm character
CN102779508B (en) * 2012-03-31 2016-11-09 科大讯飞股份有限公司 Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN87100922A (en) * 1987-02-21 1988-11-16 杭州自动化研究所 The head and the tail splicing synthetic method of Chinese characters computer voice
CN1455386A (en) * 2002-11-01 2003-11-12 中国科学院声学研究所 Imbedded voice synthesis method and system
CN104318920A (en) * 2014-10-07 2015-01-28 北京理工大学 Construction method of cross-syllable Chinese speech synthesis element with spectrum stable boundary

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307183A (en) * 2020-10-30 2021-02-02 北京金堤征信服务有限公司 Search data identification method and device, electronic equipment and computer storage medium
CN112307183B (en) * 2020-10-30 2024-04-19 北京金堤征信服务有限公司 Search data identification method, apparatus, electronic device and computer storage medium
CN113674731A (en) * 2021-05-14 2021-11-19 北京搜狗科技发展有限公司 Speech synthesis processing method, apparatus and medium
CN117672182A (en) * 2024-02-02 2024-03-08 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN109389968B (en) 2023-08-18
CN109389968A (en) 2019-02-26

Similar Documents

Publication Publication Date Title
WO2020062680A1 (en) Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium
CN110264991A (en) Training method, phoneme synthesizing method, device, equipment and the storage medium of speech synthesis model
KR20210146368A (en) End-to-end automatic speech recognition for digit sequences
US20100268539A1 (en) System and method for distributed text-to-speech synthesis and intelligibility
CN110570876B (en) Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
US11955118B2 (en) Method and apparatus with real-time translation
GB2557714A (en) Determining phonetic relationships
CN112818089B (en) Text phonetic notation method, electronic equipment and storage medium
WO2023045186A1 (en) Intention recognition method and apparatus, and electronic device and storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN114678001A (en) Speech synthesis method and speech synthesis device
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
JP7314450B2 (en) Speech synthesis method, device, equipment, and computer storage medium
WO2023129352A1 (en) Using token level context to generate ssml tags
US20220189455A1 (en) Method and system for synthesizing cross-lingual speech
CN113409761B (en) Speech synthesis method, speech synthesis device, electronic device, and computer-readable storage medium
CN114822489A (en) Text transfer method and text transfer device
Park et al. Jejueo datasets for machine translation and speech synthesis
CN112686041A (en) Pinyin marking method and device
US20230056128A1 (en) Speech processing method and apparatus, device and computer storage medium
US20240153484A1 (en) Massive multilingual speech-text joint semi-supervised learning for text-to-speech
US20220310061A1 (en) Regularizing Word Segmentation
Kabir et al. Real time bengali speech to text conversion using CMU sphinx

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18934989

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18934989

Country of ref document: EP

Kind code of ref document: A1