WO2020088006A1

WO2020088006A1 - Speech synthesis method, device, and apparatus

Info

Publication number: WO2020088006A1
Application number: PCT/CN2019/098086
Authority: WO
Inventors: 韩喆; 陈力; 吴军
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-10-29
Filing date: 2019-07-29
Publication date: 2020-05-07
Also published as: CN109599090A; TW202036534A; TWI731382B; CN109599090B

Abstract

A speech synthesis method, a device, an apparatus, and a storage medium. The method comprises: acquiring a voice file of each syllable in a text awaiting speech synthesis, the voice file storing sound intensity data of sampling points of a given syllable (S102); acquiring sound intensity data of specified sampling points from voice files of two adjacent syllables, respectively, wherein the specified sampling points of the leading syllable are the last N sampling points of said syllable, the specified sampling points of the trailing syllable are the first N sampling points of said syllable, and N is an integer (S104); and processing sound intensities of the specified sampling points of the two syllables to obtain synthesized speech data (S106). The invention achieves more natural speech synthesis by processing specified sampling points at head and tail portions of two adjacent syllables. The invention easily performs simple processing on a portion of sampling points of adjacent syllables, thereby avoiding excessive computation, and ensuring applicability to apparatuses having a low processing power, such as embedded apparatuses.

Description

Speech synthesis method, device and equipment

Technical field

The invention relates to the technical field of speech synthesis, in particular to a method, device and equipment for speech synthesis.

Background technique

Voice broadcast has applications in many areas of life, such as automatic broadcast of the amount received when using Alipay or WeChat payment, and an intelligent broadcast system used in public places such as supermarkets and stations. In speech broadcasting, speech synthesis technology is needed, that is, to stitch words or words of different syllables together to form a paragraph that needs to be broadcast. Among the current technologies for making broadcast speech, although some technologies can make the broadcast speech sound natural, this technology requires high processing power of the device; while some technologies have low requirements for processing power, they sound unnatural.

Summary of the invention

In order to overcome the problems in the related art, the present invention provides a method, device and equipment for voice splicing.

First, this specification provides a method of speech synthesis, which includes:

Acquiring a voice file of each syllable in the text of the voice to be synthesized, the voice file storing sound intensity data of sampling points of the syllable;

Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables; the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer;

The sound intensity data of the designated sampling points of the two syllables is subjected to data processing to obtain synthesized speech.

Secondly, this specification provides a speech synthesis device, which includes:

An acquiring unit, acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing sound intensity data of sampling points of the syllables; and separately acquiring the specified sampling points from the voice files of two adjacent syllables Intensity data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;

The processing unit processes the intensity data of the designated sampling points of two syllables to obtain synthesized speech.

In addition, this specification also provides a speech synthesis device, the speech synthesis device includes: a processor and a memory;

The memory is used to store executable computer instructions;

When the processor is used to execute the computer instructions, the following steps are realized:

The sound intensity data of the specified sampling points of two syllables is processed to obtain synthesized speech.

Beneficial effect of this specification: when synthesizing speech, process the sound intensity of the specified sampling points of the tail of the previous syllable and the head of the next syllable in two adjacent syllables to make the synthesized speech more natural. It is necessary to train through the learning model, but simply process some sampling points of adjacent syllables. Therefore, high-intensity calculation is avoided, which makes the solution more applicable and suitable for devices with low processing capacity such as embedded devices.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present invention.

BRIEF DESCRIPTION

The drawings herein are incorporated into and constitute a part of this specification, show embodiments consistent with the present invention, and are used to explain the principles of the present invention together with the specification.

FIG. 1 is a flowchart of a speech synthesis method shown in an exemplary embodiment of the present specification;

2 is a schematic diagram of a speech synthesis method shown in an exemplary embodiment of the present specification;

FIG. 3 is a logic block diagram of a speech synthesis device according to an exemplary embodiment of this specification;

Fig. 4 is a logic block diagram of a speech synthesis device according to an exemplary embodiment of the present specification.

detailed description

Exemplary embodiments will be described in detail here, examples of which are shown in the drawings. When referring to the drawings below, unless otherwise indicated, the same numerals in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of devices and methods consistent with some aspects of the invention as detailed in the appended claims.

The terminology used in the present invention is for the purpose of describing specific embodiments only, and is not intended to limit the present invention. The singular forms "a", "said" and "the" used in the present invention and the appended claims are also intended to include the majority forms unless the context clearly indicates other meanings. It should also be understood that the term "and / or" as used herein refers to and includes any or all possible combinations of one or more associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present invention, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to a determination".

Voice broadcasts are widely used in various fields of life, such as the broadcast of train number information in stations, the broadcast of merchandise promotion information in supermarkets, and the current arrival broadcast when paying by Alipay. The speech synthesis technology needs to use speech synthesis, that is, to stitch together words or words of different syllables to form a paragraph that needs to be broadcast. At present, some methods of speech synthesis are based on deep learning models to generate simulated speech. The speech synthesized by this method sounds natural, but due to the large amount of training and computing resources required, it is difficult to deal with embedded systems and other processing power. Run on a weak system. At present, for systems with weak processing capabilities such as embedded systems, the main method is splicing, that is, the pronunciation of each word is recorded first, and then the pronunciation of each word of the sentence to be played is played all over. This method The processing capacity of the speech synthesis system is not high, but the effect of the speech synthesized by this method is relatively poor, and it sounds unnatural.

In order to solve the problem of poor synthesis effect and unnatural sound when using splicing method for speech synthesis, this specification provides a method of speech synthesis, which can be used in a device for realizing speech synthesis, the speech synthesis method The flowchart of Fig. 1 includes steps S102-S106:

S102. Acquire a voice file of each syllable in a text of a voice to be synthesized, where the voice file stores sound intensity data of sampling points of the syllable;

S104. Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables respectively; wherein, the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the syllable The first N sampling points, where N is an integer;

S106. Process the sound intensity data of the specified sampling points of two syllables to obtain synthesized speech.

After receiving the text that needs to be synthesized, the voice file of each syllable in the text will be obtained according to the content of the text. In some cases, the voice file can be stored locally, and the voice synthesis device can directly obtain the voice file locally; in some cases, the voice file can be stored in the cloud, and the voice synthesis device can be downloaded from the cloud when it is needed.

The voice file can be a recording of different syllables recorded in advance, or a file in WAV., Mp3. And other formats. When the syllable is recorded, the analog signal of the sound is sampled and converted into binary sampling data to obtain the final Voice file. When a syllable is recorded and saved as a voice file, each syllable can be recorded separately, or in the form of a word or idiom. For example, each syllable in the phrase "I like to run" can be "I", " The five syllables of "hi", "huan", "running" and "step" are recorded and saved as five voice files respectively. You can also combine words to record a voice file, namely "me", "like" and "running" Three voice files, voice files can be recorded according to actual needs, this manual is not limited.

In one embodiment, if the syllables are recorded in the form of word combinations when recording, before acquiring the voice files of each syllable in the text of the speech to be synthesized, the synthesized text may also be subjected to word segmentation processing, so that The result is to get a syllable voice file. For example, the text to be synthesized is "we are eating". Since the saved voice files are recorded and stored in the form of "we", "at", and "eating", we can treat them before obtaining the voice files of these syllables Synthesize the text "We are eating" first to perform word segmentation in order to find the corresponding word or word voice file. The word segmentation of the text can be completed by the word segmentation algorithm. After the word segmentation of "we are eating" is processed, it is divided into "we" and " "", "Meal", and then obtain the three words "we", "in", and "meal" speech files for subsequent speech synthesis.

For devices with weak processing capabilities, such as devices with embedded systems, if you need to run the word segmentation algorithm and perform speech synthesis, it may require more memory and power consumption, which will cause slower processing speed. In order to reduce the resource consumption of the speech synthesis device, in one embodiment, the word segmentation processing of the text may be completed by the server. Since the voice files of the device are downloaded from the server, the voice files saved on the server are consistent with the voice files of the device, so the service can segment the text to be synthesized according to the voice file, and then send the text after segmentation to device.

In addition, if the text of the speech to be synthesized is Chinese text, when recording a syllable voice file, due to the large number of Chinese characters, if the pinyin of each Chinese character is stored, the voice file will be very large, which takes up memory resources, so you can only store The four tones of Chinese syllables do not need to store the pinyin of each Chinese character, which can reduce the size of the stored voice file and save memory.

In one embodiment, the voice file records audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy and / or number of sampling points. The audio duration is the pronunciation duration of each syllable, which represents the length of each syllable. The shorter the audio duration, the shorter the syllable pronunciation. The sampling frequency is the amount of sound intensity data collected at the sampling point per second. For example, the sampling frequency is 48K, which means that 48K sound intensity data is collected in 1 second. The number of sampling points for each syllable is the product of the audio duration of the syllable and the sampling frequency. For example, the audio duration of the syllable "me" is 1.2s and the sampling frequency is 48K. 48K = 57.6K. Sampling accuracy refers to the resolution of the sound processed by the capture card and reflects the accuracy of the sound waveform amplitude (ie, sound intensity). The higher the sampling accuracy, the more realistic the recorded and replayed sound. Sampling precision is also called the number of sampling bits. Since the sound signal is saved in binary form when saving, the number of saved bits can be 8 or 16 bits. If it is 8 bits, the intensity value of the collected sampling point is 0 Between -256, if it is 16 bits, the measured intensity value of the collected sampling points is between 0 and 65535. The more digits, the higher the sound quality, and the more storage space is required. Generally, when processing the sound intensity, the sound intensity data will be normalized first. For example, when the sampling precision is 8 bits, the sound intensity value of the sampling point is between 0-256, and the impact data is generally normalized. To make the sound intensity value between 0-1, which is convenient for subsequent processing.

After obtaining the voice files of each syllable in the text, the sound intensity data of the specified sampling points of two adjacent syllables can be obtained from the voice file respectively, where the specified sampling point of the previous syllable is the last N sample points of the syllable, The designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer, and the intensity of the last N sampling points of the previous syllable and the N sampling points of the previous syllable in the two adjacent syllables After the data is processed, the synthesized speech is obtained. For example, the intensity data of the last 1000 sampling points of the previous syllable and the data of 1000 sampling points in front of the following syllable can be processed, so that when the two syllables are synthesized, the tail transition is more natural. Figure 2 is a schematic diagram of a text in speech synthesis. When synthesizing the phrase "I like to run", the intensity of the specified sampling point of the previous syllable and the intensity of the specified sampling point of the following syllable can be processed one by one To get the synthesized text, where 4.5% and 5% in the figure represent the ratio of the number of sample points processed to the number of samples of the previous syllable. By processing the sound intensity data of the designated sampling points at the head and tail portions of two adjacent syllables, a more naturally synthesized speech can be obtained.

When processing two adjacent syllables, you need to retain the characteristics of the front and back syllables, so the processing part can not be too much, and you need to consider the problem of leaving white space before and after the two syllables. If the white space is too long, then deal with There will be a noticeable pause in the subsequent speech, which makes the synthesized speech sound particularly natural. Considering the above factors comprehensively, in one embodiment, when determining the designated sampling point, the number of sampling points N to be processed may be based on whether two adjacent syllables form a word or a four-character idiom, the number of adjacent two syllable sampling points, adjacent The average sound intensity of the last M1 sampling points of the two syllables and / or the average sound intensity of the M2 sampling points of the two adjacent syllables are calculated, where M1 and M2 are integers. If two syllables can form a word or idiom, the number of sampling points to be processed can be appropriately increased during processing, so the number of sampling points to be processed N can be determined according to whether two adjacent syllables can form words. In addition, the intensity of the beginning and end of each syllable is also a factor that needs to be focused on when processing, so when calculating the number of sampling points N that needs to be processed, it can also be based on the last M1 of two adjacent syllables The average sound intensity of the sampling points or the average sound intensity of the M2 sampling points of the two adjacent syllables is calculated. In addition, when the sampling frequency is fixed, the number of sampling points reflects the duration of the audio of each syllable. The difference in the audio duration of two adjacent syllables also has a greater impact on the effect of synthesized speech. If the audio duration is too large, it means that the two syllables have differences in lightness, speed, and slowness. When processing, the number of sampling points needs to be processed more. If the audio durations of the two syllables are not different, the number of sampling points to be processed can be less. some. Therefore, when calculating the number N of sampling points to be processed, the number of sampling points of the syllable can also be considered.

In order to consider the blanking problem of two adjacent syllables, the average sound intensity at the beginning and the average sound intensity at the end of the adjacent two syllables can also be considered when calculating the number of sampling points to be processed. The average sound intensity at the end can be obtained by calculating the average sound intensity of the last M1 sampling points of the syllable, and the average sound intensity at the beginning can be obtained from the average sound intensity of the M2 sampling points before the syllable, where M1 and M2 can be based on the characteristics of the syllable itself To set, for example, M1 is 10% of the total number of sampling points of the previous syllable, M2 is 5% of the total number of sampling points of the following syllable, or M1 is 1000 and M2 is 2000, which is not limited in this manual. In one embodiment, after repeated trials by the applicant, in order to achieve a better synthesis effect, so that the front and rear syllables will not have a significant sense of pause after synthesis, M1 can take 20% of the total number of audio samples of the previous syllable, M2 can Take 20% of the total number of audio sampling points of the last syllable.

Further, in an embodiment, the number N of sampling points to be processed can be calculated by the following formula:

Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or a four-character idiom, SNpre indicates the number of samples of the previous syllable, SNnext indicates the number of samples of the following syllable; the average sound intensity at the end _pre indicates the last M1 The average sound intensity of the sampling points; the average sound intensity at the beginning _next indicates the average sound intensity of the M2 sampling points before the _next syllable, and M1 and M2 are integers.

When calculating the number N of sampling points to be processed, you can consider whether the adjacent two syllables form a word or idiom. To facilitate the calculation of the number N of sampling points to be processed, you can quantify the influence factor of whether the two adjacent syllables form a word or idiom, that is, use Different values of Nw indicate whether two adjacent syllables form words or idioms, which is convenient for the calculation of N. Generally, if two adjacent syllables can form words, the value of Nw will be larger than that of non-composition words. In one embodiment, in order to achieve a better synthesis effect, if two adjacent syllables are one word, Nw takes 2; if two adjacent syllables are not in a word or four-word idiom, Nw takes 1; if adjacent Two syllables in a four-character idiom, Nw takes 2. Of course, the value of the Nw can be set according to specific circumstances, and this specification does not limit it.

For example, you need to synthesize two syllables of "I" and "No", where the sample of the syllable of "I" is 96K, and the number of samples of the syllable of "No" is 48K, that is, SNpre = 96K, SNnext = 48K, this syllable is not composed Words, so Nw can be taken as 1, that is, Nw = 1, taking the sound intensity of the last 2K sampling points of the syllable "me", and calculating the average sound intensity of the last 2K sampling points as 0.3, that is, the average sound intensity at the end pre = 0.3 , Take the sound intensity of the first 2K sampling points of the "no" syllable, calculate the average sound intensity of the first 2K sampling points as 0.2, the average sound intensity of the beginning next = 0.2, and substitute the formula to calculate, the value of N can be obtained as 3920. That is, the sound intensity data of the last 3920 sampling points of the previous syllable and the 3920 sampling points of the following syllable are taken, and the synthesized speech is obtained after processing these sound intensity data.

After acquiring the sound intensity data of the specified sampling point, the specific intensity of the specified sampling point of the two syllables can be processed according to the characteristics of the syllable. For example, in some embodiments, the former The sound intensity of the last N sampling points of the syllable is directly added to the sound intensity of the first N sampling points of the following syllable to obtain the superimposed sound intensity, for example, the sound intensity of the last five sampling points of the previous syllable and the latter The intensity of the first five sampling points of the syllable, the intensity of the last five sampling points of the previous syllable are 0.15, 0.10, 0.05, 0.03 and 0.01 respectively, and the intensity of the first five sampling points of the latter syllable are 0.005 , 0.01, 0.04, 0.06, 0.07, and 0.10, the intensity of the processed superimposed speech is 0.155, 0.11, 0.09, 0.09, 0.08, 0.11.

Of course, in order to obtain a more high-quality and natural synthesis effect, in some embodiments, the sound intensity of the last N sampling points of the previous syllable and the sound intensity of the first N sampling points of the following syllable can also be multiplied by The weights are set and then added to obtain the superimposed sound intensity, wherein the preset weights are set based on the order of syllables and the order of sampling points. When processing the syllables of two syllables before and after, you can multiply the syllables of the two syllables by a weight and then add them. For example, in general, the syllable before the front part of the processing part should be heavier, so the front The weight of a syllable can be larger. In the latter part of the processing section, the weight of the latter syllable should be heavier and the weight of the latter syllable can be larger. For example, the last five sampling points of the previous syllable and the first five sampling points of the following syllable need to be processed, and the last five sampling points of the previous syllable are respectively 0.5, 0.4, 0.3, 0.2 And 0.1, where the weights of the five sampling points are 90%, 80%, 70%, 60%, and 50%, respectively, and the intensity of the first five sampling points of the latter syllable is 0.1, 0.2, 0.3, 0.4, 0.5, where the weights of the five sampling points are 10%, 20%, 30%, 40%, and 50%, respectively, then the processed sound intensity is 0.5 × 90% + 0.1 × 10%, 0.4 × 80% + 0.2 × 20%, 0.3 × 70% + 0.3 × 30%, 0.2 × 70% + 0.4 × 40%, 0.1 × 50% + 0.5 × 50%, namely 0.46, 0.36, 0.3, 0.3, 0.3.

In order to ensure that the processed syllables will not break, the sound intensity of the designated sampling point that needs to be processed is generally not too large, to avoid broken sounds after processing. In an embodiment, the sound intensity of the specified sampling point is The ratio of the maximum sound intensity of the sampling points of the syllable is less than 0.5. For example, if the sampling point with the largest sound intensity among all sampling points of a syllable has a sound intensity of 1, then the specified sampling point to be processed has a sound intensity less than 0.5.

The following uses several specific embodiments to further explain the method of speech synthesis provided in this specification. For example, a voice device needs to synthesize the phrase "I like to run". Before speech synthesis, pre-recorded five voice files with the pronunciation of the five Chinese characters "I", "Hi", "Huan", "Run" and "Step", and these five voice files are stored in the server. And the configuration information of the voice file is recorded at the beginning of the five voice files. The sampling frequency is 48K, the sampling precision is 16 bits, and the audio duration of each pronunciation. Among them, the audio durations of "I", "Hi", "Huan", "Run" and "Step" are respectively 1s, 0.5s, 1s, 1.5s and 0.8s. The speech synthesis device will download the 5 syllable speech file from the server after receiving the text that needs to be synthesized, "I like running". Then process two consecutive syllables one by one according to the order of the text. For example, to process "me" and "hi" first, you need to process the intensity of the last sample point of "me" and the first sample point of "hi". Before processing, you need to calculate the number of sampling points to be processed according to the formula below:

Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or four-word idiom. If the two adjacent syllables are one word, the Nw takes 2 if the two adjacent syllables are not in a word or four-word idiom , Then Nw takes 1; if two adjacent syllables are in a four-word idiom, Nw takes 2. SNpre represents the number of samples of the previous syllable, SNnext represents the number of samples of the following syllable; the average sound intensity at the end pre represents the average sound intensity of the last 20% of the sampling points of the previous syllable; % The average sound intensity of the sampling points, M1 and M2 are integers.

Since "I" and "Xi" cannot form a word or idiom, Nw in the formula is 1, and the number of samples of the "I" syllable is equal to the sampling frequency multiplied by the audio duration, that is, SNpre = 0.5 × 48K = 24K, "Xi "The number of samples of this syllable SNnext = 48K x 1, the average intensity of the last 20% of the sampling points of the" I "syllable is 0.3, and the average intensity of the first 20% of the sampling points of the" sy "syllable is 0.1. These data are substituted into the above formula, and the number of sampling points that need to be processed is 711, that is, the sound intensity data of the last 711 sampling points are obtained from the voice file of the syllable "me" and the voice file of the syllable "hi" Obtain the sound intensity data of the first 711 sampling points, and then directly add the obtained sound intensity data to obtain the processed sound intensity. In the same way, "joy" and "huan", "huan" and "run", "run" and "step" are also processed in the same way, and the synthesized text "I like running" is obtained.

As another example, the text that the voice device needs to synthesize is "We love Tiananmen". When recording a voice file, it is recorded in the form of words, that is, the voice file includes the three words "we", "love", and "Tiananmen" Voice file, the voice file is downloaded from the server in advance and saved in the local directory of the voice device. After receiving the text "We Love Tiananmen" to be synthesized, the server will perform word segmentation processing on the text according to the form of a voice file, and the word segmentation processing may be completed by a word segmentation algorithm. Divide the text into "we / love / tiananmen", and then send the text after word segmentation to the speech synthesis device. After receiving the text, the speech synthesis device will first obtain three of "us", "love", and "tiananmen" Word speech file, in which the sampling frequency is 48K, the sampling precision is 8 bits, and the audio duration of the pronunciation of the three words is 2s, 1s, and 3s. Then first process "we" and "love". Before processing, you need to calculate the number of sampling points according to the following formula:

Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or four-word idiom. If the two adjacent syllables are one word, the Nw takes 2 if the two adjacent syllables are not in a word or four-word idiom , Then Nw takes 1; if two adjacent syllables are in a four-word idiom, Nw takes 2. SNpre represents the number of samples of the previous syllable, SNnext represents the number of samples of the following syllable; the average sound intensity at the end pre represents the average sound intensity of the last 15% of the sampling points of the previous syllable; % The average sound intensity of the sampling points, M1 and M2 are integers.

According to the sampling frequency and audio duration, it can be calculated that SNpre = 96K, SNnext = 48K, the average intensity of the last 15% of the sampling points of "us" is 0.2, and the average intensity of the first 20% of the sampling points of "love" is 0.3 , Before and after the syllables do not form words, Nw = 1, the data can be calculated by substituting these data into the formula to process the number of sampling points is 5689, that is, the sound intensity data of the last 5689 sampling points of "us" and "love" in front of "689" are obtained from the voice file The data of the intensity of a sample point. After acquiring the sound intensity data of the processing sampling points, multiply the sound intensity of each sampling point of "we" by a certain weight, and then multiply the sound intensity of each sampling point of "we" by a certain weight, and then add To get the intensity of the processed part. In the same way, "love" and "Tiananmen" also use the same processing method to obtain the synthesized texts "we", "love", and "Tiananmen".

Corresponding to the above speech synthesis method, this specification also provides a speech synthesis device. As shown in FIG. 3, the speech synthesis device 300 includes:

The obtaining unit 301 obtains the voice files of each syllable in the text of the voice to be synthesized, the voice file stores the sound intensity data of the sampling points of the syllables; and the specified sampling points are respectively obtained from the voice files of two adjacent syllables The sound intensity data; where the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;

The processing unit 302 processes the sound intensity data of the designated sampling points of two syllables to obtain synthesized speech.

In one embodiment, the voice file records: audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy, and / or number of sampling points.

In one embodiment, processing the intensity data of the specified sampling points of two syllables specifically includes:

Add the intensity of the last N sampling points of the previous syllable to the intensity data of the first N sampling points of the following syllable; or

Multiply the sound intensity data of the last N sampling points of the previous syllable and the sound intensity data of the first N sampling points of the following syllable by the preset weights, where the preset weights are based on the order of the syllables Set in order with the sampling point.

In one embodiment, the text of the speech to be synthesized is Chinese, and the speech file is a speech file recorded with four tones of syllables of Chinese characters.

In one embodiment, the ratio of the intensity data of the specified sampling point to the maximum intensity data of the sampling point of the syllable is less than 0.5.

In one embodiment, the N is based on whether two adjacent syllables form a word or a four-character idiom, the number of sampling points of two adjacent syllables, the average intensity of the last M1 sampling points of two adjacent syllables, and / or adjacent The average sound intensity of the M2 sampling points before the two syllables is calculated, where M1 and M2 are integers

In one embodiment, the number M1 is 20% of the total number of audio sampling points of the previous syllable, and the M2 is 20% of the total number of audio sampling points of the following syllable.

In one embodiment, if the two adjacent syllables are one word, the conversion coefficient is 2, if the two adjacent syllables are not in a word or four-word idiom, the conversion coefficient is 1, if the two adjacent syllables are in a four In the word idiom, the conversion factor is 2.

In one embodiment, the specific calculation formula of N is as follows:

Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or a four-character idiom, SNpre indicates the number of samples of the previous syllable, SNnext indicates the number of samples of the following syllable; the average sound intensity at the end pre indicates the last M1 The average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable, and M1 and M2 are integers.

In one embodiment, before acquiring the speech files of each syllable in the text of the speech to be synthesized, the method further includes:

Word segmentation processing is performed on the text.

In one embodiment, the word segmentation processing of the text is done by the server.

For the implementation process of the functions and functions of the units in the above device, please refer to the implementation process of the corresponding steps in the above method for details, which will not be repeated here.

As for the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to the description of the method embodiments. The device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located One place, or can be distributed to multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solution in this specification. Those of ordinary skill in the art can understand and implement without paying creative labor.

In addition, this specification also provides a speech synthesis device. As shown in FIG. 4, the speech synthesis device includes: a processor 401 and a memory 402;

The memory is used to store executable computer instructions;

The sound intensity of the specified sampling points of the two syllables is processed to obtain synthesized speech.

The above are only the preferred embodiments of this specification and are not intended to limit this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of this specification should be included in this specification Within the scope of protection.

Claims

A method of speech synthesis, the method comprising:

Acquiring a voice file of each syllable in the text of the voice to be synthesized, the voice file storing sound intensity data of sampling points of the syllable;

Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables; the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer;

The sound intensity data of the specified sampling points of two syllables is processed to obtain synthesized speech.
The method of speech synthesis according to claim 1, wherein the speech file records: audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy and / or number of sampling points.
The method of speech synthesis according to claim 1, wherein processing the sound intensity of the designated sampling point of two syllables specifically includes:

Add the intensity data of the last N sampling points of the previous syllable and the intensity data of the first N sampling points of the following syllable; or

Multiply the sound intensity data of the last N sampling points of the previous syllable and the sound intensity data of the first N sampling points of the following syllable by the preset weights, where the preset weights are based on the order of the syllable Set in order with the sampling point.
The method of speech synthesis according to claim 1, wherein the text of the speech to be synthesized is Chinese, and the speech file is a speech file recorded with four tones of syllables of Chinese characters.
The method of speech synthesis according to claim 1, wherein the ratio of the sound intensity data of the designated sampling point to the maximum sound intensity data of each sampling point of the syllable is less than 0.5.
A method of speech synthesis according to claim 1, said N is based on whether two adjacent syllables form a word or a four-character idiom, the number of sampling points of two adjacent syllables, and the last M1 sampling points of two adjacent syllables The average sound intensity and / or the average sound intensity of the M2 sampling points before the two adjacent syllables are determined, where M1 and M2 are integers.
A method of speech synthesis according to claim 6, wherein M1 is 20% of the total number of sampling points of the previous syllable, and M2 is 20% of the total number of sampling points of the following syllable.
A method of speech synthesis according to claim 6, the specific calculation formula of N is as follows:

Among them, the different values of Nw indicate whether two adjacent syllables form a word or a four-character idiom, SNpre indicates the number of samples of the previous syllable, SNnext indicates the number of samples of the following syllable; the average sound intensity at the end pre indicates the last M1 The average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable.
A method of speech synthesis according to claim 8, if the two adjacent syllables are one word, the value of Nw is 2, if the two adjacent syllables are not in a word or four-word idiom, the value of Nw Is 1, if the adjacent two syllables are not in a word and in a four-word idiom, the value of Nw is 2.
The method of speech synthesis according to claim 1, before acquiring the speech files of each syllable in the text of the speech to be synthesized, further comprising:

Word segmentation processing is performed on the text.
A method of speech synthesis according to claim 10, the word segmentation processing of the text is done by the server.
A speech synthesis device, the device includes:

An acquiring unit, acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing sound intensity data of sampling points of the syllables; and separately acquiring the specified sampling points from the voice files of two adjacent syllables Intensity data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;

The processing unit processes the intensity data of the designated sampling points of two syllables to obtain synthesized speech.
A speech synthesis device, the speech synthesis device includes: a processor and a memory;

The memory is used to store executable computer instructions;

The processor is used to implement the steps of the method according to any one of claims 1 to 11 when executing the computer instructions.