WO2020062680A1

WO2020062680A1 - Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium

Info

Publication number: WO2020062680A1
Application number: PCT/CN2018/124440
Authority: WO
Inventors: 房树明; 程宁; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-09-30
Filing date: 2018-12-27
Publication date: 2020-04-02
Also published as: CN109389968B; CN109389968A

Abstract

A waveform splicing method based on double syllable mixing, which belongs to the field of speech splicing synthesis. The method comprises: sound library production (step 10): dividing standard audio of a disyllabic word into first, middle and rear sections of audio according to Chinese vowels, with each section of audio being saved into a sound library as a primitive speech segment required for waveform splicing; text preprocessing (step 20): regularizing text to be converted into speech, and word-segmenting the regularized text according to speech rules to form phrases, and marking spelling and tone; phrase waveform splicing (step 30): in units of phrases after word segmentation, using every two adjacent words in phrases as a disyllabic word to be converted, and searching, from the sound library and according to a splicing rule, for a primitive speech segment corresponding to the disyllabic word to be converted; and text audio splicing (step 40): according to the order of each phrase, sequentially splicing an audio file of each phrase into a speech file of the text. According to the present invention, extremely realistic offline and real-time Chinese speech can be synthesized by means of double syllable mixing and Chinese vowel segmentation.

Description

Wave splicing method, device, equipment and storage medium based on dual-syllable mashup

This application affirms the priority of the Chinese patent application filed on September 30, 2018 with the application number 201811153693.2 and the name "Wave-splicing method, device, equipment and storage medium based on dual-syllable mashups", the entire Chinese patent application The contents are incorporated herein by reference.

Technical field

The present application relates to the field of speech splicing synthesis, and relates to a method, a device, a device, and a storage medium for waveform splicing based on a two-syllable mashup.

Background technique

The existing speech synthesis methods include two methods based on speech feature parameters and waveform stitching. Compared with the parameter-based method, the speech synthesis based on waveform splicing can obtain higher-quality synthesized speech, and the sound sounds more natural and closer to the original voice of the person who pronounced it. Therefore, the current mainstream online speech synthesis focuses on the use of waveform splicing-based speech synthesis solutions.

The so-called waveform splicing is to use recordings of different lengths as the basic unit of the speech database for synthesizing speech of any length. According to the input text, the corresponding basic unit in the splicing sound library is a simple and effective solution for generating very natural speech. On the other hand, in terms of computational complexity, it is less complex than all other speech synthesis schemes.

But before waveform splicing, finding the most suitable speech unit is an important task for waveform splicing. A general principle is that the longer the selected speech unit is, the more natural the synthesized speech is, but the larger the size of the speech database is, it may be too large to cover the entire continuous pronunciation system in a certain engineering cycle.

Summary of the Invention

The technical problem to be solved in this application is to overcome the contradiction between the naturalness of synthesized speech and the reduction of the size of the speech database in the prior art. A method, device, device and storage medium for waveform splicing based on dual-syllable mashup are proposed. It can guarantee the synthesis of high-quality continuous speech, and can cover the continuous pronunciation system in a specific scene in a short time.

This application solves the above technical problems through the following technical solutions:

A method for waveform splicing based on a two-syllable mashup includes the following steps:

Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;

Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;

Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.

This application also discloses a wave splicing device based on a two-syllable mashup, including:

A sound library production module, which is used to divide the audio of a disyllable word into three parts of the front, middle and back according to the vowels, and each piece of audio is saved to the sound library as a primitive speech segment required for waveform splicing;

A text preprocessing module, which is used to regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Phrase waveform splicing module, which is based on the phrase after the word segmentation as a unit, and regards each two adjacent words in the phrase as a two-syllable word to be converted, and searches the sound library for the first two-syllable word to be converted in the phrase The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;

The text audio splicing module is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.

The present application also discloses a computer device including a memory and a processor. The memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:

The present application also discloses a computer-readable storage medium. A computer program is stored in the computer-readable storage medium, and the computer program can be executed by at least one processor to implement the following steps:

The positive progress effect of this application lies in:

1) Through the technology of two-syllable mashups and finals segmentation, it can synthesize very realistic offline and real-time Chinese speech;

2) It can not only ensure the synthesis of high-quality continuous speech, but also cover the continuous pronunciation system in a specific scene in a short time.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a flowchart of Embodiment 1 of a method for waveform splicing based on a two-syllable mashup in the present application;

FIG. 2 shows a flowchart of text preprocessing steps in a first embodiment of a two-syllable mashup-based wave splicing method; FIG.

FIG. 3 shows a flowchart of a second embodiment of a method for waveform splicing based on a two-syllable mashup;

Figure 4 shows the original audio waveform diagram;

Figure 5 shows a standard audio waveform diagram;

FIG. 6 shows a structural diagram of a first embodiment of a waveform splicing device based on a two-syllable mashup in the present application; FIG.

FIG. 7 is a structural diagram of a second embodiment of a waveform splicing device based on a two-syllable mashup in the present application; FIG.

FIG. 8 is a schematic diagram of a hardware architecture of an embodiment of a computer device of the present application.

detailed description

The following further describes the application by way of examples, but the application is not limited to the scope of the examples.

First of all, this application proposes a method for wave stitching based on a two-syllable mashup.

In the first embodiment, as shown in FIG. 1, the method for splicing waveforms based on a two-syllable mashup includes the following steps:

Step 10. Production of the sound bank: The standard audio of the two-syllable words is divided into front, middle, and back audio according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing.

The so-called standard audio refers to audio that contains only pronunciation parts.

For standard audio segmentation, it is preferred to use the Chinese vowel vowel sound waveform (professional customer service when reading the two-syllable words aloud will generate sound waves, which can be displayed in the form of a waveform. The vowel sound waveform refers to the vowel part of the sound wave The waveform corresponding to that part of the sound) The zero point to the left of the highest point in the middle is used as the demarcation point. The three pieces of audio obtained after segmentation are saved to the sound library as primitive speech fragments. When saving, the file name of each primitive speech fragment is named after the pinyin, tone, and position of the two-syllable word corresponding to the primitive speech fragment. , Where the tones generally use the numbers 1-4 to represent the first to fourth tones, respectively, and the tones of each word directly follow the pinyin of the word. The rank indicates the order of the three audio segments after the segmentation. The numbers 0- 2 indicates the first audio segment to the third audio segment.

For example: the standard audio file for the two-syllable word "hello" is "ni2_hao3.wav", the first split position is the middle of your vowel, and the second split position is the middle of the good vowel; After the division, the three audio segments are saved into the sound library as primitive speech segments. The file names of the three primitive speech segments are "ni2_hao3_0.wav", "ni2_hao3_1.wav", and "ni2_hao3_2.wav".

Step 20: Text pre-processing: regularize the text to be converted into speech, segment the words according to the speaking rules to form a phrase, and mark the pinyin and tone.

As shown in FIG. 2, the text preprocessing specifically includes the following three steps:

Step 21: Text regularization: Non-Chinese and English characters included in the text are converted according to a preset processing rule, so that the text contains only Chinese and English and spaces.

Among them, the English speech waveform splicing method is used in English, which is different from the Chinese speech waveform splicing method. This application is only for the Chinese speech waveform splicing method. The English part is reserved during the text regularization process.

The preset processing rule may specifically be to replace Arabic numerals with Chinese characters and punctuation marks with spaces. For example: The eleven-digit telephone number "13888886666" is processed as "幺 38 888 866-6666". Assuming letters are included, the letters are not processed.

Step 22. Text segmentation: divide the text into several phrases according to the Chinese speaking rules, and add a space between each phrase to indicate a pause.

The speaking rule is a sentence segmentation rule when the Chinese language is read aloud. Take the telephone number as an example, the area code + 7 or 8 digit number, we are used to pause after speaking the area code, the 7 or 8 digit number is usually divided into two parts and paused in the middle; taking reading as an example, usually encounter The punctuation marks are paused, and the long sentence is paused.

For example, the aforementioned phone number "幺三八八八八八六六六六" after segmentation is "幺三八八八八八 six six six six six". Assuming that letters are included, consecutive letters are treated like a phrase, for example, "one two three BC four five" after segmentation is "one two three BC four five".

Step 23: Pinyin labeling: label the text after the word segmentation with pinyin and tone. Among them, sound call numbers 1-4 indicate.

For example: after the text word "unitary 3888886666" alphabet is marked "yao1 san1 ba1 ba1 ba1 ba1 ba1 liu4 liu4 liu4 liu4", wherein a space between each word corresponding to phonetic It can be used to represent the adjustable blank time.

Step 30. Splicing of phrase waveforms: using the phrase after the word segmentation as a unit, take each two adjacent words in the phrase as a two-syllable word to be converted, and find the first two-syllable to be converted from the phrase in the sound library. The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, the found primitive speech segments are spliced into an audio file of the phrase in turn.

The audio of each phrase after the word segmentation is the smallest audio file. This smallest audio file is obtained by splicing a number of primitive speech fragments.

The phrase here is composed of several words and / or phrases without pause in a sentence. Since the primitive speech segments are cut from the audio of the two-syllable words, the splicing of pair of speech waveforms is needed to achieve speech kneading. Suitable purpose. Therefore, here every two adjacent words in the phrase are divided as one disyllable word to be converted, that is, assuming that the phrase is composed of n words, then n + 1 disyllable words to be converted will be obtained by division. The second word in the converted disyllable word is the first word in the next two-syllable word to be converted. It should be noted that the n + 1 disyllable words to be converted are sorted according to their order in the phrase to determine the first disyllable word to be converted and the first one of the n + 1 disyllable words to be converted. n + 1 disyllable words to be converted.

When the phrase is divided into n + 1 disyllable words to be converted, the corresponding pinyin and tone of the phrase are also divided according to the same rules, and the divided n + 1 pairs of pinyin and tone and the divided The n + 1 disyllable words to be converted correspond one-to-one. It should be noted that the marked pinyin and tone correspond one-to-one to each word in the phrase, that is, each word in the phrase will be marked with a pair of pinyin and tone. When identifying, each number is recognized It is expressed as the end of the corresponding label on a word. Taking the first phrase "幺三八" in the aforementioned 11-digit telephone number as an example, the "幺三八" is divided into two disyllable words "幺三" and "三八" to be converted. At the same time, the phrase The corresponding pinyin and tones labeled "yao1, san1, and ba1" are also divided according to the same rules. Starting from the first letter y, when the first 1 is recognized, it means the end of the pinyin and tone corresponding to the first word "幺" is "yao1", and then starting from the next letter s, the second 1 is recognized , Indicating the end of the pinyin and tone corresponding to the second word "three" is "san1", and then the first pinyin and tone corresponding to the two-syllable word "幺三" is converted to "yao1 san1", the second to be converted The pinyin and tone division of the two-syllable word "three-eight" is the same as above, and is not repeated here. According to the pinyin and tone marked on each disyllable word to be converted, specifically, using the disyllable word to be converted as a unit, obtain the text marked with pinyin and tone corresponding to the disyllable word to be converted, and find the file name from the phonetic library The phonetic segments containing the marked pinyin and tones of the two-syllable word to be converted are included. According to the splicing rule, the first two-syllable words take their corresponding first and middle primitive phonetic fragments, and the last two-syllable take The corresponding middle and last two primitive speech fragments, if there are other two-syllable words in the middle, only the corresponding middle primitive speech fragments are taken. That is, assuming that a phrase consists of n words, it should be composed of n + 2 primitive speech segments.

Take the aforementioned 11-digit telephone number as an example:

The first phrase "幺三八" divides the two disyllable words "幺三" and "三八" to be converted, and finds the first and middle two phonetic fragments corresponding to the double syllable word "幺三" to be converted, respectively. For "yao1_san1_0.wav" and "yao1_san1_1.wav", find the middle and last two primitive speech fragments corresponding to the two-syllable word "three-eight" to be converted into "san1_ba1_1" and "san1_ba1_2" respectively. After the voice clips are spliced through the waveform, the first phrase "幺 38" is obtained. According to the naming rules of the audio file (the naming rule is that the file name corresponds to the pinyin and tone marked on the phrase, then Pinyin and tone are added with a suffix as the file name), and the file name of this audio file is set to "yao1_san1_ba1.wav" for temporary storage.

The second phrase "eight-eight-eight-eight" divides three to-be-converted two-syllable words "eight-eight", "eight-eight", and "eight-eight". The first two-syllable word "eight-eight" corresponding to the first, The middle two primary speech segments are "ba1_ba1_0" and "ba1_ba1_1", the second middle speech segment corresponding to the two-syllable word "eight and eight" is "ba1_ba1_1", and the third middle speech segment to be converted is "ba1_ba1_1". The middle and last two primitive speech fragments corresponding to "eight eight" are "ba1_ba1_1" and "ba1_ba1_2" respectively. After combining these five basic speech fragments with waveforms, we get the second phrase "eight eight eight" For the audio file, set the file name of this audio file to "ba1_ba1_ba1_ba1.wav" temporarily according to the naming rules of the audio file.

The third phrase "six six six six" divides three disyllable words "six six", "six six" and "six six" to be converted. The two primary phonetic segments in the middle are "liu1_liu1_0" and "liu1_liu1_1", the corresponding two-syllable primitive phonetic segment "liuliu" is "liu1_liu1_1", and the third one The middle and last two primitive speech fragments corresponding to “sixty-six” are “liu1_liu1_1” and “liu1_liu1_2” respectively. After combining these five primitive speech fragments with waveforms, the second phrase “six six six six” is obtained. For the audio file, set the file name of this audio file to "liu1_liu1_liu1_liu1.wav" temporarily according to the naming rules of the audio file.

Step 40: Text and audio splicing: According to the order of the phrases in the text to be converted into speech, the audio files of the obtained phrases are directly spliced into the text speech file in order.

When the audio files of a phrase are spliced into a text voice file, they can be spliced directly, but there will be a pause between each phrase. Therefore, preferably, when directly splicing, you can add appropriate between the audio files of each phrase as needed. Length of silence.

In the second embodiment, based on the first embodiment, as shown in FIG. 3, the method for splicing waveforms based on two-syllable mashups includes the following steps:

Step 01: Audio recording: Record the two-syllable words read aloud by a professional customer service, and save the two-syllable words as the original audio file.

Because the audio file here is used for waveform splicing, and Chinese characters have many homophones and different words, in the original audio file, these homophones need only be recorded once. For example: the two-syllable words "balance" and "jiejie" need only be recorded once. In other words, the number of disyllabic words is determined by pinyin and tone. Several words with the same pinyin and tone are treated as the same disyllabic word when recording audio.

Step 02: Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.

Generally speaking, the original audio will have a mute part. The waveform is shown in Figure 4. The part with large ripples in the middle is the pronunciation part, and the part with less ripples at both ends is the mute part. After the mute part is cut off, the waveform of the standard audio is shown in Figure 5.

Steps 10 to 40 are the same as those in the first embodiment, and details are not described herein again.

Secondly, the present application proposes a wave splicing device based on a two-syllable mashup. The device 20 can be divided into one or more modules.

For example, FIG. 6 shows a structural diagram of a first embodiment of the dual-syllable mashup-based wave splicing device 20. In this embodiment, the device 20 may be divided into a sound bank production module 201 and a text preprocessing module 202. , Phrase waveform splicing module 203 and text audio splicing module 204. The following description will specifically introduce the specific functions of the modules 201-204.

The sound bank production module 201 is configured to divide the standard audio of a two-syllable word into three pieces of audio according to the vowel, and each piece of audio is stored in the sound database as a primitive speech segment required for waveform splicing;

The text preprocessing module 202 is used for regularizing the text to be converted into speech, segmenting the regularized text according to the speaking rules to form a phrase, and marking the pinyin and tone;

The phrase waveform splicing module 203 is configured to take each two adjacent words in the phrase as a two-syllable word to be converted and use the phrase after the word segmentation as a unit to find the first to-be-transformed phrase in the phrase library. The first and middle two-syllabic phonetic snippets of the disyllabic word, the last two syllabic phonetic snippets of the last two-syllable word to be converted, and the other middle-speech syllabic words of the two-syllabic word to be converted. The order of the two-syllable words in the phrase, stitching the found primitive speech segments into audio files of the phrase;

The text-audio splicing module 204 is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.

As another example, FIG. 7 shows a structural diagram of a second embodiment of the dual-syllable mashup-based waveform splicing device 20. In this embodiment, the dual-syllable mashup-based waveform splicing device 20 can also be divided into sound banks. The production module 201, the text preprocessing module 202, the phrase waveform splicing module 203, the text audio splicing module 204, the audio recording module 205, and the mute segment segmentation module 206.

The modules 201-204 are the same as those in the first embodiment, and details are not described herein again.

The audio recording module 205 is configured to record a two-syllable word read aloud by a professional customer service, and save it as an original audio file in units of the two-syllable word;

The mute segment segmentation module 206 is used to cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.

Again, this application also proposes a computer device.

FIG. 8 is a schematic diagram of a hardware architecture of a computer device according to an embodiment of the present application. In this embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and / or information processing according to an instruction set or stored in advance. For example, it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of multiple servers). As shown in the figure, the computer device 2 includes at least, but is not limited to, a memory 21, a processor 22, and a network interface 23 which can communicate with each other through a system bus. among them:

The memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), Static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, Flash card, etc. Of course, the memory 21 may also include both an internal storage unit of the computer device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used to store an operating system and various application software installed on the computer device 2, such as a computer program used to implement the dual-syllable mashup-based waveform splicing method. In addition, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

In some embodiments, the processor 22 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or another data processing chip. The processor 22 is generally used to control the overall operation of the computer device 2, for example, to perform control and processing related to data interaction or communication with the computer device 2. In this embodiment, the processor 22 is configured to run program code or process data stored in the memory 21, for example, to run a computer program used to implement the dual-syllable mashup-based waveform splicing method.

The network interface 23 may include a wireless network interface or a wired network interface. The network interface 23 is generally used to establish a communication connection between the computer device 2 and other computer devices. For example, the network interface 23 is configured to connect the computer device 2 and an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal. The network may be an intranet, the Internet, a Global System for Mobile Communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, 5G Wireless, wired or other networks such as Internet, Bluetooth, Wi-Fi.

It should be noted that FIG. 8 shows only the computer device 2 with components 21-23, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

In this embodiment, the computer program stored in the memory 21 for implementing the two-syllable mashup-based waveform splicing method may be executed by one or more processors (processor 22 in this embodiment) to complete The following steps:

Step 10: Making a sound bank: Dividing the standard audio of the two-syllable words into three parts of the front, middle, and back according to the vowels, and each piece of audio is saved to the sound bank as a primitive speech segment required for waveform splicing;

Step 20: Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Step 30. Splicing of phrase waveforms: using the phrase after the word segmentation as a unit, take each two adjacent words in the phrase as a two-syllable word to be converted, and find the first two-syllable to be converted from the phrase in the sound library. The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;

Step 40: Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.

In an embodiment, before step 10, the method further includes the following steps:

Step 01: Audio recording: Record the two-syllable words read aloud by professional customer service, and save the original two-syllable words as the original audio file;

In addition, a computer-readable storage medium is provided in the present application. The computer-readable storage medium is a non-volatile readable storage medium in which a computer program is stored, and the computer program can be executed by at least one processor to The operation of the above-mentioned two-syllable mashup-based waveform splicing method or device is realized.

The computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM), a read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the computer-readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer-readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk equipped on the computer device, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital , SD) card, flash memory card (Flash card), etc. Of course, the computer-readable storage medium may also include both the internal storage unit of the computer device and its external storage device. In this embodiment, the computer-readable storage medium is generally used to store an operating system and various application software installed on a computer device, such as the aforementioned computer program for implementing the dual-syllable mashup-based waveform splicing method. In addition, the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.

Although the specific implementation of the present application is described above, those skilled in the art should understand that this is only an example, and the protection scope of the present application is defined by the appended claims. Those skilled in the art can make various changes or modifications to these embodiments without departing from the principle and essence of this application, but these changes and modifications fall within the protection scope of this application.

Claims

A method for waveform splicing based on a two-syllable mashup, which includes the following steps:

Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;

Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;

Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
The method for wave stitching based on dual-syllable mashups according to claim 1, further comprising the following steps before the production of the sound bank:

Audio recording: Record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;

Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
The method according to claim 1 or 2, wherein the file name of the elementary speech segment is named after the pinyin, tone, and position of the two-syllable word corresponding to the elementary speech segment.
The method for splicing waveforms based on dual-syllable mashups according to claim 1 or 2, characterized in that, when the audio of a dual-syllable word is divided into three sections of audio: front, middle, and last according to the vowel, the waveform is uttered with the Chinese vowel. The zero point to the left of the highest point in the middle is the demarcation point.
The method for wave stitching based on a two-syllable mashup according to claim 1 or 2, wherein the text preprocessing specifically includes the following steps:

Text regularization: converting non-Chinese and English characters contained in the text according to a preset processing rule;

Text word segmentation: divide the text into several phrases according to Chinese speaking habits, and add a space between each phrase to indicate a pause;

Pinyin labeling: Label the text after the word segmentation with pinyin and tone.
The method according to claim 3, characterized in that, in the phrase waveform splicing, according to the pinyin and tone marked on each of the two-syllable words to be converted, from the sound bank Look for a primitive speech segment in which the file name contains the pinyin and tones marked on the two-syllable word; and then obtain the primitive speech segment in which the filename contains the corresponding segment from the found primitive speech segment according to the stitching rules .
A waveform splicing device based on a two-syllable mashup, which includes:

A sound bank production module, which is used to divide the standard audio of a two-syllable word into three parts of front, middle, and back according to the vowels, and each piece of audio is saved to the sound bank as the primitive speech segment required for waveform splicing;

A text preprocessing module, which is used to regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Phrase waveform splicing module, which is based on the phrase after the word segmentation as a unit, and regards each two adjacent words in the phrase as a two-syllable word to be converted, and searches the sound library for the first two-syllable word to be converted in the phrase The first and middle primitive speech fragments of the word, the middle and last two primitive speech fragments of the last disyllable word to be converted, and the middle primitive speech fragments of other disyllable words to be converted. The order of the words in the phrase, stitching the found primitive speech segments into audio files of the phrase;

The text audio splicing module is configured to directly splice the audio files of the obtained phrases into the text voice files in order according to the order of the phrases in the text to be converted into speech.
The dual-syllable mashup-based wave splicing device according to claim 7, further comprising:

Audio recording module, which is used to record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;

The mute segment segmentation module is used to cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
A computer device includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the following steps are implemented:

Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;

Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;

Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
The computer device according to claim 9, further comprising the following steps before making the sound bank:

Audio recording: Record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;

Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
The computer device according to claim 9 or 10, wherein a file name of the primitive speech segment is named after a pinyin, a tone, and a position of a two-syllable word corresponding to the primitive speech segment.
The computer device according to claim 9 or 10, characterized in that when the audio of a disyllable word is divided into three parts of front, middle and back according to the vowel, the left zero of the highest point in the middle of the utterance waveform of the Chinese vowel is uttered As a demarcation point.
The computer device according to claim 9 or 10, wherein the text preprocessing specifically includes the following steps:

Text regularization: converting non-Chinese and English characters contained in the text according to a preset processing rule;

Text word segmentation: divide the text into several phrases according to Chinese speaking habits, and add a space between each phrase to indicate a pause;

Pinyin labeling: Label the text after the word segmentation with pinyin and tone.
The computer device according to claim 11, characterized in that, in the phrase waveform splicing, according to the pinyin and tone marked on each of the two-syllable words to be converted, the file name is looked up from the sound library and contains Primitive phonetic fragments marked with pinyin and tones marked on the two-syllable words; and based on the stitching rules, primitive phonetic fragments containing the corresponding segments in the file name are obtained from the found primitive phonetic fragments.
A computer-readable storage medium is characterized in that a computer program is stored in the computer-readable storage medium, and the computer program can be executed by at least one processor to implement the following steps:

Sound bank production: The standard audio of a two-syllable word is divided into three parts of the front, middle, and back according to the vowels, and each audio is saved to the sound bank as a primitive speech segment required for waveform splicing;

Text preprocessing: regularize the text to be converted into speech, segment the words according to the speaking rules to form phrases, and mark the pinyin and tone;

Phrase wave splicing: taking the phrase after the word segmentation as a unit, taking each two adjacent words in the phrase as a two-syllable word to be converted, and searching from the sound bank for the first two-syllable word to be converted in the phrase. , Middle two primitive speech fragments, the middle and last two primitive speech fragments of the last disyllable word to be converted, and other middle primitive speech fragments of the disyllable word to be converted. The sequence in the phrase is described, and the found primitive speech segments are spliced into an audio file of the phrase in turn;

Text and audio splicing: According to the order of each phrase in the text to be converted into speech, the audio files of each phrase obtained are directly directly spliced into a speech file of the text.
The computer-readable storage medium according to claim 15, further comprising the following steps before making the sound bank:

Audio recording: Record the two-syllable words read aloud by professional customer service and save them as the original audio file in units of two-syllable words;

Mute segment division: Cut out the mute part before and after the audio in the original audio file, and save the pronunciation part in the audio as the standard audio of the two-syllable word.
The computer-readable storage medium according to claim 15 or 16, characterized in that the file name of the primitive speech segment is named after the pinyin, tone, and position of the two-syllable word corresponding to the primitive speech segment.
The computer-readable storage medium according to claim 15 or 16, characterized in that when the audio of a disyllable word is divided into three parts of front, middle, and back according to the vowel, the highest point in the middle of the utterance waveform of the Chinese vowel is used. The zero point on the left is the demarcation point.
The computer-readable storage medium according to claim 15 or 16, wherein the text preprocessing specifically includes the following steps:

Text regularization: converting non-Chinese and English characters contained in the text according to a preset processing rule;

Text word segmentation: divide the text into several phrases according to Chinese speaking habits, and add a space between each phrase to indicate a pause;

Pinyin labeling: Label the text after the word segmentation with pinyin and tone.
The computer-readable storage medium according to claim 17, wherein in the phrase waveform splicing, a file name is searched from the sound bank according to the pinyin and tone marked on each of the two-syllable words to be converted. The phonetic primitives include the phonetic segments of the pinyin and tone marked on the two-syllable words, and the primitive voice segments containing the corresponding segments in the file name are obtained from the found primitive voice segments according to the stitching rules.