WO2020088006A1 - Speech synthesis method, device, and apparatus - Google Patents

Speech synthesis method, device, and apparatus Download PDF

Info

Publication number
WO2020088006A1
WO2020088006A1 PCT/CN2019/098086 CN2019098086W WO2020088006A1 WO 2020088006 A1 WO2020088006 A1 WO 2020088006A1 CN 2019098086 W CN2019098086 W CN 2019098086W WO 2020088006 A1 WO2020088006 A1 WO 2020088006A1
Authority
WO
WIPO (PCT)
Prior art keywords
syllable
sampling points
syllables
sound intensity
speech
Prior art date
Application number
PCT/CN2019/098086
Other languages
French (fr)
Chinese (zh)
Inventor
韩喆
陈力
吴军
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020088006A1 publication Critical patent/WO2020088006A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the invention relates to the technical field of speech synthesis, in particular to a method, device and equipment for speech synthesis.
  • Voice broadcast has applications in many areas of life, such as automatic broadcast of the amount received when using Alipay or WeChat payment, and an intelligent broadcast system used in public places such as supermarkets and stations.
  • speech broadcasting speech synthesis technology is needed, that is, to stitch words or words of different syllables together to form a paragraph that needs to be broadcast.
  • speech synthesis technology is needed, that is, to stitch words or words of different syllables together to form a paragraph that needs to be broadcast.
  • this technology requires high processing power of the device; while some technologies have low requirements for processing power, they sound unnatural.
  • the present invention provides a method, device and equipment for voice splicing.
  • this specification provides a method of speech synthesis, which includes:
  • the specified sampling point of the previous syllable is the last N sampling points of the syllable
  • the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer
  • the sound intensity data of the designated sampling points of the two syllables is subjected to data processing to obtain synthesized speech.
  • this specification provides a speech synthesis device, which includes:
  • An acquiring unit acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing sound intensity data of sampling points of the syllables; and separately acquiring the specified sampling points from the voice files of two adjacent syllables Intensity data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;
  • the processing unit processes the intensity data of the designated sampling points of two syllables to obtain synthesized speech.
  • this specification also provides a speech synthesis device, the speech synthesis device includes: a processor and a memory;
  • the memory is used to store executable computer instructions
  • the specified sampling point of the previous syllable is the last N sampling points of the syllable
  • the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer
  • the sound intensity data of the specified sampling points of two syllables is processed to obtain synthesized speech.
  • FIG. 1 is a flowchart of a speech synthesis method shown in an exemplary embodiment of the present specification
  • FIG. 2 is a schematic diagram of a speech synthesis method shown in an exemplary embodiment of the present specification
  • FIG. 3 is a logic block diagram of a speech synthesis device according to an exemplary embodiment of this specification
  • Fig. 4 is a logic block diagram of a speech synthesis device according to an exemplary embodiment of the present specification.
  • first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to a determination”.
  • Voice broadcasts are widely used in various fields of life, such as the broadcast of train number information in stations, the broadcast of merchandise promotion information in supermarkets, and the current arrival broadcast when paying by Alipay.
  • the speech synthesis technology needs to use speech synthesis, that is, to stitch together words or words of different syllables to form a paragraph that needs to be broadcast.
  • some methods of speech synthesis are based on deep learning models to generate simulated speech.
  • the speech synthesized by this method sounds natural, but due to the large amount of training and computing resources required, it is difficult to deal with embedded systems and other processing power. Run on a weak system.
  • the main method is splicing, that is, the pronunciation of each word is recorded first, and then the pronunciation of each word of the sentence to be played is played all over.
  • This method The processing capacity of the speech synthesis system is not high, but the effect of the speech synthesized by this method is relatively poor, and it sounds unnatural.
  • S106 Process the sound intensity data of the specified sampling points of two syllables to obtain synthesized speech.
  • the voice file of each syllable in the text will be obtained according to the content of the text.
  • the voice file can be stored locally, and the voice synthesis device can directly obtain the voice file locally; in some cases, the voice file can be stored in the cloud, and the voice synthesis device can be downloaded from the cloud when it is needed.
  • the voice file can be a recording of different syllables recorded in advance, or a file in WAV., Mp3. And other formats.
  • the analog signal of the sound is sampled and converted into binary sampling data to obtain the final Voice file.
  • each syllable can be recorded separately, or in the form of a word or idiom.
  • each syllable in the phrase “I like to run” can be “I”
  • the five syllables of "hi”, “huan”, “running” and “step” are recorded and saved as five voice files respectively.
  • the synthesized text may also be subjected to word segmentation processing, so that The result is to get a syllable voice file.
  • the text to be synthesized is "we are eating”. Since the saved voice files are recorded and stored in the form of "we", “at”, and “eating”, we can treat them before obtaining the voice files of these syllables Synthesize the text "We are eating” first to perform word segmentation in order to find the corresponding word or word voice file. The word segmentation of the text can be completed by the word segmentation algorithm. After the word segmentation of "we are eating” is processed, it is divided into “we” and “ "”, “Meal”, and then obtain the three words "we", “in”, and “meal” speech files for subsequent speech synthesis.
  • the word segmentation processing of the text may be completed by the server. Since the voice files of the device are downloaded from the server, the voice files saved on the server are consistent with the voice files of the device, so the service can segment the text to be synthesized according to the voice file, and then send the text after segmentation to device.
  • the text of the speech to be synthesized is Chinese text
  • the voice file will be very large, which takes up memory resources, so you can only store The four tones of Chinese syllables do not need to store the pinyin of each Chinese character, which can reduce the size of the stored voice file and save memory.
  • the voice file records audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy and / or number of sampling points.
  • the audio duration is the pronunciation duration of each syllable, which represents the length of each syllable. The shorter the audio duration, the shorter the syllable pronunciation.
  • the sampling frequency is the amount of sound intensity data collected at the sampling point per second. For example, the sampling frequency is 48K, which means that 48K sound intensity data is collected in 1 second.
  • Sampling accuracy refers to the resolution of the sound processed by the capture card and reflects the accuracy of the sound waveform amplitude (ie, sound intensity). The higher the sampling accuracy, the more realistic the recorded and replayed sound.
  • Sampling precision is also called the number of sampling bits. Since the sound signal is saved in binary form when saving, the number of saved bits can be 8 or 16 bits. If it is 8 bits, the intensity value of the collected sampling point is 0 Between -256, if it is 16 bits, the measured intensity value of the collected sampling points is between 0 and 65535. The more digits, the higher the sound quality, and the more storage space is required.
  • the sound intensity data will be normalized first. For example, when the sampling precision is 8 bits, the sound intensity value of the sampling point is between 0-256, and the impact data is generally normalized. To make the sound intensity value between 0-1, which is convenient for subsequent processing.
  • the sound intensity data of the specified sampling points of two adjacent syllables can be obtained from the voice file respectively, where the specified sampling point of the previous syllable is the last N sample points of the syllable,
  • the designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer, and the intensity of the last N sampling points of the previous syllable and the N sampling points of the previous syllable in the two adjacent syllables
  • the synthesized speech is obtained.
  • FIG. 2 is a schematic diagram of a text in speech synthesis.
  • the intensity of the specified sampling point of the previous syllable and the intensity of the specified sampling point of the following syllable can be processed one by one
  • 4.5% and 5% in the figure represent the ratio of the number of sample points processed to the number of samples of the previous syllable.
  • the number of sampling points N to be processed may be based on whether two adjacent syllables form a word or a four-character idiom, the number of adjacent two syllable sampling points, adjacent The average sound intensity of the last M1 sampling points of the two syllables and / or the average sound intensity of the M2 sampling points of the two adjacent syllables are calculated, where M1 and M2 are integers. If two syllables can form a word or idiom, the number of sampling points to be processed can be appropriately increased during processing, so the number of sampling points to be processed N can be determined according to whether two adjacent syllables can form words.
  • the intensity of the beginning and end of each syllable is also a factor that needs to be focused on when processing, so when calculating the number of sampling points N that needs to be processed, it can also be based on the last M1 of two adjacent syllables The average sound intensity of the sampling points or the average sound intensity of the M2 sampling points of the two adjacent syllables is calculated.
  • the sampling frequency is fixed, the number of sampling points reflects the duration of the audio of each syllable. The difference in the audio duration of two adjacent syllables also has a greater impact on the effect of synthesized speech. If the audio duration is too large, it means that the two syllables have differences in lightness, speed, and slowness.
  • the number of sampling points needs to be processed more. If the audio durations of the two syllables are not different, the number of sampling points to be processed can be less. some. Therefore, when calculating the number N of sampling points to be processed, the number of sampling points of the syllable can also be considered.
  • the average sound intensity at the beginning and the average sound intensity at the end of the adjacent two syllables can also be considered when calculating the number of sampling points to be processed.
  • the average sound intensity at the end can be obtained by calculating the average sound intensity of the last M1 sampling points of the syllable, and the average sound intensity at the beginning can be obtained from the average sound intensity of the M2 sampling points before the syllable, where M1 and M2 can be based on the characteristics of the syllable itself
  • M1 is 10% of the total number of sampling points of the previous syllable
  • M2 is 5% of the total number of sampling points of the following syllable
  • M1 is 1000 and M2 is 2000, which is not limited in this manual.
  • M1 can take 20% of the total number of audio samples of the previous syllable
  • M2 can Take 20% of the total number of audio sampling points of the last syllable.
  • the number N of sampling points to be processed can be calculated by the following formula:
  • Nw indicates whether the current two adjacent syllables form a word or a four-character idiom
  • SNpre indicates the number of samples of the previous syllable
  • SNnext indicates the number of samples of the following syllable
  • the average sound intensity at the end pre indicates the last M1
  • the average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable
  • M1 and M2 are integers.
  • Nw When calculating the number N of sampling points to be processed, you can consider whether the adjacent two syllables form a word or idiom. To facilitate the calculation of the number N of sampling points to be processed, you can quantify the influence factor of whether the two adjacent syllables form a word or idiom, that is, use Different values of Nw indicate whether two adjacent syllables form words or idioms, which is convenient for the calculation of N. Generally, if two adjacent syllables can form words, the value of Nw will be larger than that of non-composition words.
  • Nw in order to achieve a better synthesis effect, if two adjacent syllables are one word, Nw takes 2; if two adjacent syllables are not in a word or four-word idiom, Nw takes 1; if adjacent Two syllables in a four-character idiom, Nw takes 2.
  • the value of the Nw can be set according to specific circumstances, and this specification does not limit it.
  • the specific intensity of the specified sampling point of the two syllables can be processed according to the characteristics of the syllable.
  • the former The sound intensity of the last N sampling points of the syllable is directly added to the sound intensity of the first N sampling points of the following syllable to obtain the superimposed sound intensity, for example, the sound intensity of the last five sampling points of the previous syllable and the latter
  • the intensity of the first five sampling points of the syllable, the intensity of the last five sampling points of the previous syllable are 0.15, 0.10, 0.05, 0.03 and 0.01 respectively, and the intensity of the first five sampling points of the latter syllable are 0.005 , 0.01, 0.04, 0.06, 0.07, and 0.10
  • the intensity of the processed superimposed speech is 0.155, 0.11, 0.09, 0.09, 0.08, 0.11.
  • the sound intensity of the last N sampling points of the previous syllable and the sound intensity of the first N sampling points of the following syllable can also be multiplied by The weights are set and then added to obtain the superimposed sound intensity, wherein the preset weights are set based on the order of syllables and the order of sampling points.
  • the syllable before the front part of the processing part should be heavier, so the front The weight of a syllable can be larger.
  • the weight of the latter syllable should be heavier and the weight of the latter syllable can be larger.
  • the last five sampling points of the previous syllable and the first five sampling points of the following syllable need to be processed, and the last five sampling points of the previous syllable are respectively 0.5, 0.4, 0.3, 0.2 And 0.1, where the weights of the five sampling points are 90%, 80%, 70%, 60%, and 50%, respectively, and the intensity of the first five sampling points of the latter syllable is 0.1, 0.2, 0.3, 0.4, 0.5, where the weights of the five sampling points are 10%, 20%, 30%, 40%, and 50%, respectively, then the processed sound intensity is 0.5 ⁇ 90% + 0.1 ⁇ 10%, 0.4 ⁇ 80% + 0.2 ⁇ 20%, 0.3 ⁇ 70% + 0.3 ⁇ 30%, 0.2 ⁇ 70% + 0.4 ⁇ 40%, 0.1 ⁇ 50% + 0.5 ⁇ 50%, namely 0.46, 0.36, 0.3, 0.3, 0.3.
  • the sound intensity of the designated sampling point that needs to be processed is generally not too large, to avoid broken sounds after processing.
  • the sound intensity of the specified sampling point is The ratio of the maximum sound intensity of the sampling points of the syllable is less than 0.5. For example, if the sampling point with the largest sound intensity among all sampling points of a syllable has a sound intensity of 1, then the specified sampling point to be processed has a sound intensity less than 0.5.
  • a voice device needs to synthesize the phrase "I like to run”.
  • pre-recorded five voice files with the pronunciation of the five Chinese characters "I”, “Hi”, “Huan”, “Run” and “Step”, and these five voice files are stored in the server. And the configuration information of the voice file is recorded at the beginning of the five voice files.
  • the sampling frequency is 48K
  • the sampling precision is 16 bits
  • the audio durations of "I”, “Hi”, “Huan”, “Run” and “Step” are respectively 1s, 0.5s, 1s, 1.5s and 0.8s.
  • the speech synthesis device will download the 5 syllable speech file from the server after receiving the text that needs to be synthesized, "I like running”. Then process two consecutive syllables one by one according to the order of the text. For example, to process "me” and “hi" first, you need to process the intensity of the last sample point of "me” and the first sample point of "hi”. Before processing, you need to calculate the number of sampling points to be processed according to the formula below:
  • the different values of Nw indicate whether the current two adjacent syllables form a word or four-word idiom. If the two adjacent syllables are one word, the Nw takes 2 if the two adjacent syllables are not in a word or four-word idiom , Then Nw takes 1; if two adjacent syllables are in a four-word idiom, Nw takes 2.
  • SNpre represents the number of samples of the previous syllable
  • SNnext represents the number of samples of the following syllable
  • the average sound intensity at the end pre represents the average sound intensity of the last 20% of the sampling points of the previous syllable
  • % The average sound intensity of the sampling points, M1 and M2 are integers.
  • the number of sampling points that need to be processed is 711, that is, the sound intensity data of the last 711 sampling points are obtained from the voice file of the syllable "me” and the voice file of the syllable “hi” Obtain the sound intensity data of the first 711 sampling points, and then directly add the obtained sound intensity data to obtain the processed sound intensity.
  • “joy” and “huan”, “huan” and “run”, “run” and “step” are also processed in the same way, and the synthesized text "I like running" is obtained.
  • the text that the voice device needs to synthesize is "We love Tiananmen”.
  • the voice file When recording a voice file, it is recorded in the form of words, that is, the voice file includes the three words “we”, “love”, and "Tiananmen”
  • the voice file is downloaded from the server in advance and saved in the local directory of the voice device.
  • the server After receiving the text "We Love Tiananmen" to be synthesized, the server will perform word segmentation processing on the text according to the form of a voice file, and the word segmentation processing may be completed by a word segmentation algorithm. Divide the text into "we / love / tiananmen", and then send the text after word segmentation to the speech synthesis device.
  • the speech synthesis device After receiving the text, the speech synthesis device will first obtain three of "us”, “love”, and “tiananmen” Word speech file, in which the sampling frequency is 48K, the sampling precision is 8 bits, and the audio duration of the pronunciation of the three words is 2s, 1s, and 3s. Then first process "we” and "love”. Before processing, you need to calculate the number of sampling points according to the following formula:
  • the different values of Nw indicate whether the current two adjacent syllables form a word or four-word idiom. If the two adjacent syllables are one word, the Nw takes 2 if the two adjacent syllables are not in a word or four-word idiom , Then Nw takes 1; if two adjacent syllables are in a four-word idiom, Nw takes 2.
  • SNpre represents the number of samples of the previous syllable
  • SNnext represents the number of samples of the following syllable
  • the average sound intensity at the end pre represents the average sound intensity of the last 15% of the sampling points of the previous syllable
  • % The average sound intensity of the sampling points, M1 and M2 are integers.
  • Nw 1
  • the data can be calculated by substituting these data into the formula to process the number of sampling points is 5689, that is, the sound intensity data of the last 5689 sampling points of "us” and “love” in front of "689” are obtained from the voice file The data of the intensity of a sample point.
  • the speech synthesis device 300 includes:
  • the obtaining unit 301 obtains the voice files of each syllable in the text of the voice to be synthesized, the voice file stores the sound intensity data of the sampling points of the syllables; and the specified sampling points are respectively obtained from the voice files of two adjacent syllables The sound intensity data; where the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;
  • the processing unit 302 processes the sound intensity data of the designated sampling points of two syllables to obtain synthesized speech.
  • the voice file records: audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy, and / or number of sampling points.
  • processing the intensity data of the specified sampling points of two syllables specifically includes:
  • the text of the speech to be synthesized is Chinese
  • the speech file is a speech file recorded with four tones of syllables of Chinese characters.
  • the ratio of the intensity data of the specified sampling point to the maximum intensity data of the sampling point of the syllable is less than 0.5.
  • the N is based on whether two adjacent syllables form a word or a four-character idiom, the number of sampling points of two adjacent syllables, the average intensity of the last M1 sampling points of two adjacent syllables, and / or adjacent The average sound intensity of the M2 sampling points before the two syllables is calculated, where M1 and M2 are integers
  • the number M1 is 20% of the total number of audio sampling points of the previous syllable, and the M2 is 20% of the total number of audio sampling points of the following syllable.
  • the conversion coefficient is 2, if the two adjacent syllables are not in a word or four-word idiom, the conversion coefficient is 1, if the two adjacent syllables are in a four In the word idiom, the conversion factor is 2.
  • Nw indicates whether the current two adjacent syllables form a word or a four-character idiom
  • SNpre indicates the number of samples of the previous syllable
  • SNnext indicates the number of samples of the following syllable
  • the average sound intensity at the end pre indicates the last M1
  • the average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable
  • M1 and M2 are integers.
  • the method before acquiring the speech files of each syllable in the text of the speech to be synthesized, the method further includes:
  • Word segmentation processing is performed on the text.
  • the word segmentation processing of the text is done by the server.
  • the relevant parts can be referred to the description of the method embodiments.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located One place, or can be distributed to multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solution in this specification. Those of ordinary skill in the art can understand and implement without paying creative labor.
  • the speech synthesis device includes: a processor 401 and a memory 402;
  • the memory is used to store executable computer instructions
  • the specified sampling point of the previous syllable is the last N sampling points of the syllable
  • the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer
  • the sound intensity of the specified sampling points of the two syllables is processed to obtain synthesized speech.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

A speech synthesis method, a device, an apparatus, and a storage medium. The method comprises: acquiring a voice file of each syllable in a text awaiting speech synthesis, the voice file storing sound intensity data of sampling points of a given syllable (S102); acquiring sound intensity data of specified sampling points from voice files of two adjacent syllables, respectively, wherein the specified sampling points of the leading syllable are the last N sampling points of said syllable, the specified sampling points of the trailing syllable are the first N sampling points of said syllable, and N is an integer (S104); and processing sound intensities of the specified sampling points of the two syllables to obtain synthesized speech data (S106). The invention achieves more natural speech synthesis by processing specified sampling points at head and tail portions of two adjacent syllables. The invention easily performs simple processing on a portion of sampling points of adjacent syllables, thereby avoiding excessive computation, and ensuring applicability to apparatuses having a low processing power, such as embedded apparatuses.

Description

一种语音合成的方法、装置及设备Speech synthesis method, device and equipment 技术领域Technical field
本发明涉及语音合成技术领域,尤其涉及一种语音合成的方法、装置及设备。The invention relates to the technical field of speech synthesis, in particular to a method, device and equipment for speech synthesis.
背景技术Background technique
语音播报在生活中很多领域都有应用,比如在使用支付宝或微信付款时自动播报到账金额,超市、车站等公共场所使用的智能播报系统等。在语音播报时,需要用到语音合成技术,即将不同音节的字或词语拼接起来,组成需要播报的一段话。目前制作播报语音的技术中,有的技术虽然可以使播报的语音听起来自然,但是此技术对设备的处理能力要求高;有的技术虽然对处理能力要求不高,但是听起来不自然。Voice broadcast has applications in many areas of life, such as automatic broadcast of the amount received when using Alipay or WeChat payment, and an intelligent broadcast system used in public places such as supermarkets and stations. In speech broadcasting, speech synthesis technology is needed, that is, to stitch words or words of different syllables together to form a paragraph that needs to be broadcast. Among the current technologies for making broadcast speech, although some technologies can make the broadcast speech sound natural, this technology requires high processing power of the device; while some technologies have low requirements for processing power, they sound unnatural.
发明内容Summary of the invention
为克服相关技术中存在的问题,本发明提供了一种语音拼接的方法、装置及设备。In order to overcome the problems in the related art, the present invention provides a method, device and equipment for voice splicing.
首先,本说明书提供了一种语音合成的方法,所述方法包括:First, this specification provides a method of speech synthesis, which includes:
获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;Acquiring a voice file of each syllable in the text of the voice to be synthesized, the voice file storing sound intensity data of sampling points of the syllable;
从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables; the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer;
将两个音节的所述指定采样点的音强数据进行数据处理,以获得合成后的语音。The sound intensity data of the designated sampling points of the two syllables is subjected to data processing to obtain synthesized speech.
其次,本说明书提供了一种语音合成装置,所述装置包括:Secondly, this specification provides a speech synthesis device, which includes:
获取单元,获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;以及从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;An acquiring unit, acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing sound intensity data of sampling points of the syllables; and separately acquiring the specified sampling points from the voice files of two adjacent syllables Intensity data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;
处理单元,将两个音节的所述指定采样点的音强数据进行处理,以获得合成后的语音。The processing unit processes the intensity data of the designated sampling points of two syllables to obtain synthesized speech.
另外,本说明书还提供了一种语音合成设备,所述语音合成设备包括:处理器和存储器;In addition, this specification also provides a speech synthesis device, the speech synthesis device includes: a processor and a memory;
所述存储器用于存储可执行的计算机指令;The memory is used to store executable computer instructions;
所述处理器用于执行所述计算机指令时实现以下步骤:When the processor is used to execute the computer instructions, the following steps are realized:
获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;Acquiring a voice file of each syllable in the text of the voice to be synthesized, the voice file storing sound intensity data of sampling points of the syllable;
从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables; the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer;
将两个音节的所述指定采样点的音强数据进行处理,以获得合成后的语音。The sound intensity data of the specified sampling points of two syllables is processed to obtain synthesized speech.
本说明书的有益效果:在语音合成时,将相邻两音节中前一个音节的尾部与后一个音节的首部的指定采样点的音强进行处理,使合成后的语音更加自然,另外,由于不需要通过学习模型训练,而是对相邻音节部分采样点做简单处理,因此避免了高强度的计算,使本方案更加具有适用性,适用于嵌入式设备等处理能力较低的设备。Beneficial effect of this specification: when synthesizing speech, process the sound intensity of the specified sampling points of the tail of the previous syllable and the head of the next syllable in two adjacent syllables to make the synthesized speech more natural. It is necessary to train through the learning model, but simply process some sampling points of adjacent syllables. Therefore, high-intensity calculation is avoided, which makes the solution more applicable and suitable for devices with low processing capacity such as embedded devices.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本发明。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present invention.
附图说明BRIEF DESCRIPTION
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。The drawings herein are incorporated into and constitute a part of this specification, show embodiments consistent with the present invention, and are used to explain the principles of the present invention together with the specification.
图1为本说明书一示例性实施例示出的一种语音合成方法流程图;FIG. 1 is a flowchart of a speech synthesis method shown in an exemplary embodiment of the present specification;
图2为本说明书一示例性实施例示出的一种语音合成方法示意图;2 is a schematic diagram of a speech synthesis method shown in an exemplary embodiment of the present specification;
图3为本说明书一示例性实施例示出的一种语音合成装置的逻辑框图;FIG. 3 is a logic block diagram of a speech synthesis device according to an exemplary embodiment of this specification;
图4为本说明书一示例性实施例示出的一种语音合成设备的逻辑框图。Fig. 4 is a logic block diagram of a speech synthesis device according to an exemplary embodiment of the present specification.
具体实施方式detailed description
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施 例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail here, examples of which are shown in the drawings. When referring to the drawings below, unless otherwise indicated, the same numerals in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of devices and methods consistent with some aspects of the invention as detailed in the appended claims.
在本发明使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present invention is for the purpose of describing specific embodiments only, and is not intended to limit the present invention. The singular forms "a", "said" and "the" used in the present invention and the appended claims are also intended to include the majority forms unless the context clearly indicates other meanings. It should also be understood that the term "and / or" as used herein refers to and includes any or all possible combinations of one or more associated listed items.
应当理解,尽管在本发明可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本发明范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present invention, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to a determination".
语音播报广泛的应用于生活中的各个领域,比如车站中的车次信息的播报,超市中商品促销信息播放、以及目前常用的支付宝支付时的到账播报等。语音播报时需要用到语音合成技术,即将不同音节的字或词语拼接起来,组成需要播报的一段话。目前有的语音合成的方法是基于深度学习模型,生成模拟的语音,这种方法合成的语音听起来比较自然,但是由于需要大量的训练资源和计算资源,很难在嵌入式系统等处理能力较弱的系统上运行。目前,针对嵌入式系统等处理能力较弱的系统,主要采用的是拼接的方法,即先录制每一个单词的读音,然后把待播放的句子的每个单词的读音全部播放一遍,这种方法对语音合成系统的处理能力要求不高,但是这种方法合成的语音效果比较差,听起来不自然。Voice broadcasts are widely used in various fields of life, such as the broadcast of train number information in stations, the broadcast of merchandise promotion information in supermarkets, and the current arrival broadcast when paying by Alipay. The speech synthesis technology needs to use speech synthesis, that is, to stitch together words or words of different syllables to form a paragraph that needs to be broadcast. At present, some methods of speech synthesis are based on deep learning models to generate simulated speech. The speech synthesized by this method sounds natural, but due to the large amount of training and computing resources required, it is difficult to deal with embedded systems and other processing power. Run on a weak system. At present, for systems with weak processing capabilities such as embedded systems, the main method is splicing, that is, the pronunciation of each word is recorded first, and then the pronunciation of each word of the sentence to be played is played all over. This method The processing capacity of the speech synthesis system is not high, but the effect of the speech synthesized by this method is relatively poor, and it sounds unnatural.
为了解决采用拼接的方法进行语音合成时,合成效果较差,听起来不自然的问题,本说明书提供了一种语音合成的方法,所述方法可用于实现语音合成的设备,所述语音合成方法的流程图如图1所示,包括步骤S102-步骤S106:In order to solve the problem of poor synthesis effect and unnatural sound when using splicing method for speech synthesis, this specification provides a method of speech synthesis, which can be used in a device for realizing speech synthesis, the speech synthesis method The flowchart of Fig. 1 includes steps S102-S106:
S102、获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;S102. Acquire a voice file of each syllable in a text of a voice to be synthesized, where the voice file stores sound intensity data of sampling points of the syllable;
S104、从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;S104. Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables respectively; wherein, the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the syllable The first N sampling points, where N is an integer;
S106、将两个音节的所述指定采样点的音强数据进行处理,以获得合成后的语音。S106. Process the sound intensity data of the specified sampling points of two syllables to obtain synthesized speech.
在收到需要合成语音的文本后,会根据文本的内容获取文本中每个音节的语音文件。在某些情况下语音文件可以存储在本地,语音合成设备可以直接从本地获取语音文件;在某些情况下,语音文件可以保存在云端,语音合成设备需要使用时从云端下载。After receiving the text that needs to be synthesized, the voice file of each syllable in the text will be obtained according to the content of the text. In some cases, the voice file can be stored locally, and the voice synthesis device can directly obtain the voice file locally; in some cases, the voice file can be stored in the cloud, and the voice synthesis device can be downloaded from the cloud when it is needed.
语音文件可以是事先录制好的不同的音节的录音,可以是WAV.、Mp3.等格式的文件,在音节录制的时候,会对声音的模拟信号进行采样,转化成二进制的采样数据,得到最终的语音文件。音节在录制并保存成语音文件的时候,可以将各音节单独录制,也可以以一个词语或成语的形式录制,比如“我喜欢跑步”这句话中的各音节,可以是“我”、“喜”、“欢”、“跑”、“步”五个音节分别录制保存成五份语音文件,也可以将词语组合起来录制成一个语音文件,即“我”、“喜欢”、“跑步”三份语音文件,语音文件可以根据实际需求录制,本说明书不作限制。The voice file can be a recording of different syllables recorded in advance, or a file in WAV., Mp3. And other formats. When the syllable is recorded, the analog signal of the sound is sampled and converted into binary sampling data to obtain the final Voice file. When a syllable is recorded and saved as a voice file, each syllable can be recorded separately, or in the form of a word or idiom. For example, each syllable in the phrase "I like to run" can be "I", " The five syllables of "hi", "huan", "running" and "step" are recorded and saved as five voice files respectively. You can also combine words to record a voice file, namely "me", "like" and "running" Three voice files, voice files can be recorded according to actual needs, this manual is not limited.
在一个实施例中,如果音节在录制的时候是以词语组合的形式录制的,在获取待合成语音的文本中的各音节的语音文件之前,还可以对待合成文本进行分词处理,以便根据分词的结果去获取音节的语音文件。比如待合成文本是“我们在吃饭”,由于保存的语音文件是以“我们”、“在”、“吃饭”这种词语的形式录制存储的,所以我们在获取这些音节的语音文件之前可以对待合成文本“我们在吃饭”先进行分词处理,以便找到对应的词语或字的语音文件,对文本的分词可通过分词算法来完成,将“我们在吃饭”分词处理后即分成“我们”、“在”、“吃饭”,然后再获取“我们”、“在”、“吃饭”这三个词的语音文件,进行后续的语音合成。In one embodiment, if the syllables are recorded in the form of word combinations when recording, before acquiring the voice files of each syllable in the text of the speech to be synthesized, the synthesized text may also be subjected to word segmentation processing, so that The result is to get a syllable voice file. For example, the text to be synthesized is "we are eating". Since the saved voice files are recorded and stored in the form of "we", "at", and "eating", we can treat them before obtaining the voice files of these syllables Synthesize the text "We are eating" first to perform word segmentation in order to find the corresponding word or word voice file. The word segmentation of the text can be completed by the word segmentation algorithm. After the word segmentation of "we are eating" is processed, it is divided into "we" and " "", "Meal", and then obtain the three words "we", "in", and "meal" speech files for subsequent speech synthesis.
对于处理能力较弱的设备,比如嵌入式系统的设备,如果又要运行分词算法,又要进行语音合成,可能需要耗费较多的内存和功耗,会导致处理速度较慢。为了减小语音合成设备的资源消耗,在一个实施例中,对所述文本进行分词处理可以由服务器端完成。由于设备的语音文件都是从服务器端下载的,服务器端保存的语音文件与设备的语音文件是一致的,所以服务可以根据语音文件将待合成文本进行分词,然后将经过分词的文本下发给设备。For devices with weak processing capabilities, such as devices with embedded systems, if you need to run the word segmentation algorithm and perform speech synthesis, it may require more memory and power consumption, which will cause slower processing speed. In order to reduce the resource consumption of the speech synthesis device, in one embodiment, the word segmentation processing of the text may be completed by the server. Since the voice files of the device are downloaded from the server, the voice files saved on the server are consistent with the voice files of the device, so the service can segment the text to be synthesized according to the voice file, and then send the text after segmentation to device.
另外,如果待合成语音的文本是中文文本,在录制音节的语音文件时,由于汉字的数量较多,如果存储每个汉字的拼音,语音文件会很大,非常占用内存资源,所以可以只存储汉字音节的四个声调,无需存储每个汉字的拼音,这样可以减小存储的语音文件的大小,节约内存。In addition, if the text of the speech to be synthesized is Chinese text, when recording a syllable voice file, due to the large number of Chinese characters, if the pinyin of each Chinese character is stored, the voice file will be very large, which takes up memory resources, so you can only store The four tones of Chinese syllables do not need to store the pinyin of each Chinese character, which can reduce the size of the stored voice file and save memory.
在一个实施例中,所述语音文件记录有音节的音频时长、采样点的音强数据、采样频率、采样精度和/或采样点数量。其中,音频时长为每个音节的发音时长,表征每个音 节发音的长短,音频时长越短,则音节发音越短促。采样频率为每秒中采集采样点音强数据的数量,比如采样频率为48K,表示1秒中采集48K个音强数据。每个音节的采样点数量则为该音节的音频时长与采样频率的乘积,比如“我”这个音节的音频时长为1.2s,采样频率为48K,则“我”这个音节采样数量一共有1.2×48K=57.6K个。采样精度是指采集卡处理声音的解析度,反映了声音波形幅度(即音强)的精度。采样精度越高,录制和回放的声音就越真实。采样精度也叫采样位数,由于声音信号在保存的时候都是以二进制的形式保存,保存的位数可以是8位或16位,如果是8位,则采集的采样点音强数值在0-256之间,如果是16位,则测得的采集的采样点音强数值在0~65535之间。位数越多,声音的质量越高,而需要的存储空间也越多。一般在对音强进行处理的时候,会先对音强数据进行归一化处理,比如采样精度为8位时,采样点音强数值在0-256之间,一般会对影响数据进行归一化处理,使音强数值在0-1之间,便于后续处理。In one embodiment, the voice file records audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy and / or number of sampling points. The audio duration is the pronunciation duration of each syllable, which represents the length of each syllable. The shorter the audio duration, the shorter the syllable pronunciation. The sampling frequency is the amount of sound intensity data collected at the sampling point per second. For example, the sampling frequency is 48K, which means that 48K sound intensity data is collected in 1 second. The number of sampling points for each syllable is the product of the audio duration of the syllable and the sampling frequency. For example, the audio duration of the syllable "me" is 1.2s and the sampling frequency is 48K. 48K = 57.6K. Sampling accuracy refers to the resolution of the sound processed by the capture card and reflects the accuracy of the sound waveform amplitude (ie, sound intensity). The higher the sampling accuracy, the more realistic the recorded and replayed sound. Sampling precision is also called the number of sampling bits. Since the sound signal is saved in binary form when saving, the number of saved bits can be 8 or 16 bits. If it is 8 bits, the intensity value of the collected sampling point is 0 Between -256, if it is 16 bits, the measured intensity value of the collected sampling points is between 0 and 65535. The more digits, the higher the sound quality, and the more storage space is required. Generally, when processing the sound intensity, the sound intensity data will be normalized first. For example, when the sampling precision is 8 bits, the sound intensity value of the sampling point is between 0-256, and the impact data is generally normalized. To make the sound intensity value between 0-1, which is convenient for subsequent processing.
在获取文本中的各音节的语音文件后,可以从语音文件中分别获取相邻两音节指定采样点的音强数据,其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数,将相邻两音节中前一音节最后N个采样点与后一音节前N个采样点的音强数据进行处理后,得到合成后的语音。例如,可以将前一个音节的最后1000个采样点的音强数据与后一个音节前面1000个采样点的数据进行处理,以便两个音节在合成时,尾部过渡自然些。图2为一文本在进行语音合成的示意图,在合成“我喜欢跑步”这句话时,可以逐一将前一个音节的指定采样点的音强和后一个音节的指定采样点的音强进行处理,以得到合成后的文本,其中图中4.5%和5%代表处理采样点数量与前一音节采样数量的比值。通过将相邻两音节的首尾部分的指定采样点的音强数据进行处理,可得到衔接比较自然的合成语音。After obtaining the voice files of each syllable in the text, the sound intensity data of the specified sampling points of two adjacent syllables can be obtained from the voice file respectively, where the specified sampling point of the previous syllable is the last N sample points of the syllable, The designated sampling point of the next syllable is the first N sampling points of the syllable, where N is an integer, and the intensity of the last N sampling points of the previous syllable and the N sampling points of the previous syllable in the two adjacent syllables After the data is processed, the synthesized speech is obtained. For example, the intensity data of the last 1000 sampling points of the previous syllable and the data of 1000 sampling points in front of the following syllable can be processed, so that when the two syllables are synthesized, the tail transition is more natural. Figure 2 is a schematic diagram of a text in speech synthesis. When synthesizing the phrase "I like to run", the intensity of the specified sampling point of the previous syllable and the intensity of the specified sampling point of the following syllable can be processed one by one To get the synthesized text, where 4.5% and 5% in the figure represent the ratio of the number of sample points processed to the number of samples of the previous syllable. By processing the sound intensity data of the designated sampling points at the head and tail portions of two adjacent syllables, a more naturally synthesized speech can be obtained.
在对相邻两音节进行处理时,需要保留前后音节的本身的特点,所以处理的部分不能太多,还需考虑前后两音节在处理时前后留白的问题,如果留白过长,则处理后的语音会出现明显的停顿,造成合成的语音听起来特别自然。综合考虑以上因素,在一个实施例中,在确定指定采样点时,需要处理的采样点数量N可以基于相邻两音节是否组成词语或四字成语、相邻两音节的采样点数量、相邻两音节的最后M1个采样点的平均音强和/或相邻两音节前M2个采样点的平均音强计算得到,其中M1、M2为整数。如果两个音节可以组成一个词语或成语,在处理的时候需处理采样点数量可以适当的多一些,所以可以根据相邻两音节是否可以组成词语来确定需处理采样点数量N。另外,每个音节开头部分和末尾部分的音强也是处理时候需要重点关注的一个因素,所以,在计算需需处理采样点数量N时,也可以基于相邻两音节相邻两音节的最后M1个采样点的平均 音强或相邻两音节前M2个采样点的平均音强来计算。另外,在采样频率一定时,采样点的数量的多少即反映了的每个音节音频的时长,相邻两个音节的音频时长的差别对合成语音的效果影响也比较大,如果两个音节的音频时长过大,说明两个音节有轻重、快慢的差别,在处理时需处理采样点的数量需要多一些,如果两个音节的音频时长相差不大,则需处理的采样点的数量可以少一些。所以,在计算需处理采样点数量N时,也可以考虑音节的采样点数量。When processing two adjacent syllables, you need to retain the characteristics of the front and back syllables, so the processing part can not be too much, and you need to consider the problem of leaving white space before and after the two syllables. If the white space is too long, then deal with There will be a noticeable pause in the subsequent speech, which makes the synthesized speech sound particularly natural. Considering the above factors comprehensively, in one embodiment, when determining the designated sampling point, the number of sampling points N to be processed may be based on whether two adjacent syllables form a word or a four-character idiom, the number of adjacent two syllable sampling points, adjacent The average sound intensity of the last M1 sampling points of the two syllables and / or the average sound intensity of the M2 sampling points of the two adjacent syllables are calculated, where M1 and M2 are integers. If two syllables can form a word or idiom, the number of sampling points to be processed can be appropriately increased during processing, so the number of sampling points to be processed N can be determined according to whether two adjacent syllables can form words. In addition, the intensity of the beginning and end of each syllable is also a factor that needs to be focused on when processing, so when calculating the number of sampling points N that needs to be processed, it can also be based on the last M1 of two adjacent syllables The average sound intensity of the sampling points or the average sound intensity of the M2 sampling points of the two adjacent syllables is calculated. In addition, when the sampling frequency is fixed, the number of sampling points reflects the duration of the audio of each syllable. The difference in the audio duration of two adjacent syllables also has a greater impact on the effect of synthesized speech. If the audio duration is too large, it means that the two syllables have differences in lightness, speed, and slowness. When processing, the number of sampling points needs to be processed more. If the audio durations of the two syllables are not different, the number of sampling points to be processed can be less. some. Therefore, when calculating the number N of sampling points to be processed, the number of sampling points of the syllable can also be considered.
为了考虑相邻两音节的留白问题,在计算需处理采样点数量时还可以考虑相邻两音节的开头的平均音强和末尾的平均音强。末尾的平均音强可以通过计算音节最后M1个采样点的平均音强获得,开头的平均音强可以取音节前M2个采样点的平均音强获得,其中M1和M2可以根据音节自身的特点去设定,比如M1为前一个音节采样点总数的10%,M2为后一个音节采样点总数的5%,或者M1为1000,M2为2000,本说明书不作限制。在一个实施例中,经过申请人的反复试验,为了达到较好合成效果,使前后音节在合成后不会有明显的停顿感,M1可以取前一个音节音频采样点总数的20%,M2可以取后一个音节音频采样点总数的20%。In order to consider the blanking problem of two adjacent syllables, the average sound intensity at the beginning and the average sound intensity at the end of the adjacent two syllables can also be considered when calculating the number of sampling points to be processed. The average sound intensity at the end can be obtained by calculating the average sound intensity of the last M1 sampling points of the syllable, and the average sound intensity at the beginning can be obtained from the average sound intensity of the M2 sampling points before the syllable, where M1 and M2 can be based on the characteristics of the syllable itself To set, for example, M1 is 10% of the total number of sampling points of the previous syllable, M2 is 5% of the total number of sampling points of the following syllable, or M1 is 1000 and M2 is 2000, which is not limited in this manual. In one embodiment, after repeated trials by the applicant, in order to achieve a better synthesis effect, so that the front and rear syllables will not have a significant sense of pause after synthesis, M1 can take 20% of the total number of audio samples of the previous syllable, M2 can Take 20% of the total number of audio sampling points of the last syllable.
进一步的,在一个实施例中,需处理的采样点的数量N可以通过以下公式去计算:Further, in an embodiment, the number N of sampling points to be processed can be calculated by the following formula:
Figure PCTCN2019098086-appb-000001
Figure PCTCN2019098086-appb-000001
其中,Nw的不同取值表示当前相邻两音节是否组成词语或四字成语,SNpre表示前一个音节的采样数量,SNnext表示后一个音节的采样数量;末尾平均音强 pre表示前一个音节最后M1个采样点的平均音强;开头平均音强 next表示后一个音节前M2个采样点的平均音强,M1、M2为整数。 Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or a four-character idiom, SNpre indicates the number of samples of the previous syllable, SNnext indicates the number of samples of the following syllable; the average sound intensity at the end pre indicates the last M1 The average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable, and M1 and M2 are integers.
在计算需处理采样点数量N时可以考虑相邻两音节是否组成词语或成语,为了方便计算需处理采样点数量N,可以将相邻两音节是否组成词语或成语时这个影响因素量化,即用Nw的不同数值来表示相邻两音节是否组成词语或成语,便于N的计算,一般如果相邻两音节可以组成词语,Nw数值会比不能组成词语大。在一个实施例中,为了达到较好的合成效果,如果相邻两音节为一个词语,则Nw取2,如果相邻两音节不在一个词语或四字成语中,则Nw取1,如果相邻两音节在一个四字成语中,则Nw取2。当然,所述Nw的取值可根据具体的情况去设定,本说明书不作限制。When calculating the number N of sampling points to be processed, you can consider whether the adjacent two syllables form a word or idiom. To facilitate the calculation of the number N of sampling points to be processed, you can quantify the influence factor of whether the two adjacent syllables form a word or idiom, that is, use Different values of Nw indicate whether two adjacent syllables form words or idioms, which is convenient for the calculation of N. Generally, if two adjacent syllables can form words, the value of Nw will be larger than that of non-composition words. In one embodiment, in order to achieve a better synthesis effect, if two adjacent syllables are one word, Nw takes 2; if two adjacent syllables are not in a word or four-word idiom, Nw takes 1; if adjacent Two syllables in a four-character idiom, Nw takes 2. Of course, the value of the Nw can be set according to specific circumstances, and this specification does not limit it.
例如,需要合成“我”、“不”两个音节,其中“我”这个音节的采样为96K,“不”这个音节的采样数量为48K,即SNpre=96K,SNnext=48K,这个音节不组成词语,所以Nw可以取1,即Nw=1,取“我”这个音节的最后2K的采样点的音强,计算最后2K个采样点的平均音强为0.3,即末尾平均音强pre=0.3,取“不”这个音节的前面2K个采样点的音强,计算前面2K个采样点的平均音强为0.2,开头平均音强next=0.2,代入公式计算,可得到N的值为3920。即取前一个音节的最后3920个采样点与后一个音节前3920个采样点的音强数据,将这些音强数据处理后得到合成的语音。For example, you need to synthesize two syllables of "I" and "No", where the sample of the syllable of "I" is 96K, and the number of samples of the syllable of "No" is 48K, that is, SNpre = 96K, SNnext = 48K, this syllable is not composed Words, so Nw can be taken as 1, that is, Nw = 1, taking the sound intensity of the last 2K sampling points of the syllable "me", and calculating the average sound intensity of the last 2K sampling points as 0.3, that is, the average sound intensity at the end pre = 0.3 , Take the sound intensity of the first 2K sampling points of the "no" syllable, calculate the average sound intensity of the first 2K sampling points as 0.2, the average sound intensity of the beginning next = 0.2, and substitute the formula to calculate, the value of N can be obtained as 3920. That is, the sound intensity data of the last 3920 sampling points of the previous syllable and the 3920 sampling points of the following syllable are taken, and the synthesized speech is obtained after processing these sound intensity data.
在获取指定采样点的音强数据后,将两个音节的所述指定采样点的音强进行处理具体方式也可以根据音节的特点来选择,比如,在某些实施例中,可以将前一个音节的最后N个采样点的音强与后一个音节的前N个采样点音强直接相加,得到叠加的音强,比如需要处理前一个音节的最后五个采样点的音强以及后一个音节的前五个采样点的音强,前一个音节的最后五个采样点的音强分别为0.15、0.10、0.05、0.03和0.01,后一个音节的前五个采样点的音强分别为0.005、0.01、0.04、0.06、0.07和0.10,则处理后的叠加部分的语音的音强为0.155、0.11、0.09、0.09、0.08、0.11。After acquiring the sound intensity data of the specified sampling point, the specific intensity of the specified sampling point of the two syllables can be processed according to the characteristics of the syllable. For example, in some embodiments, the former The sound intensity of the last N sampling points of the syllable is directly added to the sound intensity of the first N sampling points of the following syllable to obtain the superimposed sound intensity, for example, the sound intensity of the last five sampling points of the previous syllable and the latter The intensity of the first five sampling points of the syllable, the intensity of the last five sampling points of the previous syllable are 0.15, 0.10, 0.05, 0.03 and 0.01 respectively, and the intensity of the first five sampling points of the latter syllable are 0.005 , 0.01, 0.04, 0.06, 0.07, and 0.10, the intensity of the processed superimposed speech is 0.155, 0.11, 0.09, 0.09, 0.08, 0.11.
当然,为了获得更加优质和自然的合成效果,在某些实施例中,也可以将前一个音节的最后N个采样点的音强与后一个音节的前N个采样点音强分别乘以预设权重后再相加,得到叠加的音强,其中,所述预设权重基于音节的前后顺序与采样点的前后顺序设定。在进行前后相邻两音节的音强的处理的时候,可以将前后两音节的音强乘以一个权重后再相加,比如,一般在处理部分的前面部分前一个音节要重一些,因此前一个音节的权重可以大一些,在处理部分的后面部分,后一个音节要重一些因而后一个音节的权重可以大一些。举个例子,需要将前一个音节最后五个采样点与后一个音节前五个采样点的音强进行处理,前一个音节的最后五个采样点音强分别为0.5、0.4、0.3、0.2和0.1,其中,五个采样点的权重分别为90%、80%、70%、60%、50%,后一个音节前五个采样点的音强分别为0.1、0.2、0.3、0.4、0.5,其中,五个采样点的权重分别为10%、20%、30%、40%、50%,则处理后的音强分别为0.5×90%+0.1×10%、0.4×80%+0.2×20%、0.3×70%+0.3×30%、0.2×70%+0.4×40%、0.1×50%+0.5×50%,即0.46、0.36、0.3、0.3、0.3。Of course, in order to obtain a more high-quality and natural synthesis effect, in some embodiments, the sound intensity of the last N sampling points of the previous syllable and the sound intensity of the first N sampling points of the following syllable can also be multiplied by The weights are set and then added to obtain the superimposed sound intensity, wherein the preset weights are set based on the order of syllables and the order of sampling points. When processing the syllables of two syllables before and after, you can multiply the syllables of the two syllables by a weight and then add them. For example, in general, the syllable before the front part of the processing part should be heavier, so the front The weight of a syllable can be larger. In the latter part of the processing section, the weight of the latter syllable should be heavier and the weight of the latter syllable can be larger. For example, the last five sampling points of the previous syllable and the first five sampling points of the following syllable need to be processed, and the last five sampling points of the previous syllable are respectively 0.5, 0.4, 0.3, 0.2 And 0.1, where the weights of the five sampling points are 90%, 80%, 70%, 60%, and 50%, respectively, and the intensity of the first five sampling points of the latter syllable is 0.1, 0.2, 0.3, 0.4, 0.5, where the weights of the five sampling points are 10%, 20%, 30%, 40%, and 50%, respectively, then the processed sound intensity is 0.5 × 90% + 0.1 × 10%, 0.4 × 80% + 0.2 × 20%, 0.3 × 70% + 0.3 × 30%, 0.2 × 70% + 0.4 × 40%, 0.1 × 50% + 0.5 × 50%, namely 0.46, 0.36, 0.3, 0.3, 0.3.
为了保证处理后的音节不会出现破音的现象,需要处理的指定采样点的音强一般不会太大,避免处理后破音,在某个实施例中,指定采样点的音强与该音节的采样点的最大音强的比值小于0.5。比如,音节的所有采样点中的音强最大的采样点的音强为1,那 么指定的需要处理的采样点的音强都小于0.5。In order to ensure that the processed syllables will not break, the sound intensity of the designated sampling point that needs to be processed is generally not too large, to avoid broken sounds after processing. In an embodiment, the sound intensity of the specified sampling point is The ratio of the maximum sound intensity of the sampling points of the syllable is less than 0.5. For example, if the sampling point with the largest sound intensity among all sampling points of a syllable has a sound intensity of 1, then the specified sampling point to be processed has a sound intensity less than 0.5.
以下用几个具体实施例来进一步解释本说明书提供的语音合成的方法。比如语音设备需要对“我喜欢跑步”这句话进行语音合成。在语音合成前,预先录制有“我”、“喜”、“欢”、“跑”、“步”这五个汉字的读音的五份语音文件,这五份语音文件保存在服务器中。且五份语音文件的开头记录有语音文件的配置信息,采样频率为48K,采样精度为16位,以及每个读音的音频时长。其中,“我”、“喜”、“欢”、“跑”、“步”的音频时长分别为1s、0.5s、1s、1.5s、0.8s。语音合成设备在收到需要合成语音的文本,“我喜欢跑步”后,会从服务器下载这个5个音节的语音文件。然后按照文本的顺序逐一对连续两个音节做处理,比如先对“我”和“喜”进行处理,需要处理“我”最后一部采样点和“喜”最前面一部分采样点的音强,在处理前需要先根据以后公式计算需要处理的采样点的数量:The following uses several specific embodiments to further explain the method of speech synthesis provided in this specification. For example, a voice device needs to synthesize the phrase "I like to run". Before speech synthesis, pre-recorded five voice files with the pronunciation of the five Chinese characters "I", "Hi", "Huan", "Run" and "Step", and these five voice files are stored in the server. And the configuration information of the voice file is recorded at the beginning of the five voice files. The sampling frequency is 48K, the sampling precision is 16 bits, and the audio duration of each pronunciation. Among them, the audio durations of "I", "Hi", "Huan", "Run" and "Step" are respectively 1s, 0.5s, 1s, 1.5s and 0.8s. The speech synthesis device will download the 5 syllable speech file from the server after receiving the text that needs to be synthesized, "I like running". Then process two consecutive syllables one by one according to the order of the text. For example, to process "me" and "hi" first, you need to process the intensity of the last sample point of "me" and the first sample point of "hi". Before processing, you need to calculate the number of sampling points to be processed according to the formula below:
Figure PCTCN2019098086-appb-000002
Figure PCTCN2019098086-appb-000002
其中,Nw的不同取值表示当前相邻两音节是否组成词语或四字成语,如果相邻两音节为一个词语,则所述Nw取2,如果相邻两音节不在一个词语或四字成语中,则Nw取1,如果相邻两音节在一个四字成语中,则Nw取2。SNpre表示前一个音节的采样数量,SNnext表示后一个音节的采样数量;末尾平均音强pre表示前一个音节最后20%的采样点的平均音强;开头平均音强next表示后一个音节前20%的采样点的平均音强,M1、M2为整数。Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or four-word idiom. If the two adjacent syllables are one word, the Nw takes 2 if the two adjacent syllables are not in a word or four-word idiom , Then Nw takes 1; if two adjacent syllables are in a four-word idiom, Nw takes 2. SNpre represents the number of samples of the previous syllable, SNnext represents the number of samples of the following syllable; the average sound intensity at the end pre represents the average sound intensity of the last 20% of the sampling points of the previous syllable; % The average sound intensity of the sampling points, M1 and M2 are integers.
由于“我”和“喜”不能组成一个词语或成语,所以公式中的Nw取1,“我”这个音节的采样数量等于采样频率乘以音频时长,即SNpre=0.5×48K=24K,“喜”这个音节的采样数量SNnext=48K×1,“我”这个音节最后20%的采样点的平均音强为0.3,“喜”这个音节最前面20%的采样点的平均音强为0.1,将这些数据代入以上公式,可以获得需要处理的采样点的数量为711,即从“我”这个音节的语音文件中获取最后711个采样点的音强数据,和“喜”这个音节的语音文件中获取最前711个采样点的音强数据,然后将获取的音强数据直接相加,得到处理后的音强。同理,“喜”和“欢”,“欢”和“跑”,“跑”和“步”之间也采用同样的方式进行处理,得到合成以后的文本“我喜欢跑步”。Since "I" and "Xi" cannot form a word or idiom, Nw in the formula is 1, and the number of samples of the "I" syllable is equal to the sampling frequency multiplied by the audio duration, that is, SNpre = 0.5 × 48K = 24K, "Xi "The number of samples of this syllable SNnext = 48K x 1, the average intensity of the last 20% of the sampling points of the" I "syllable is 0.3, and the average intensity of the first 20% of the sampling points of the" sy "syllable is 0.1. These data are substituted into the above formula, and the number of sampling points that need to be processed is 711, that is, the sound intensity data of the last 711 sampling points are obtained from the voice file of the syllable "me" and the voice file of the syllable "hi" Obtain the sound intensity data of the first 711 sampling points, and then directly add the obtained sound intensity data to obtain the processed sound intensity. In the same way, "joy" and "huan", "huan" and "run", "run" and "step" are also processed in the same way, and the synthesized text "I like running" is obtained.
再比如,语音设备需要合成的文本为“我们爱天安门”,在录制语音文件时,是以词语的形式录制的,即语音文件中包括有“我们”、“爱”、“天安门”三个词的语音文件,语音文件预先从服务器下载下来并保存在语音设备本地目录当中。服务器收到需要合成的文本“我们爱天安门”后,会根据语音文件的形式对文本进行分词处理,分词处理可通过分词算法完成。将文本分成“我们/爱/天安门”,然后将分词处理后的文本下发给语音合成设备,语音合成设备在收到文本后,会先获取“我们”、“爱”、“天安门”三个词的语音文件,其中采样频率为48K,采样精度为8位,以及三个词读音的音频时长分贝为2s、1s、3s。然后先对“我们”和“爱”进行处理,处理前需要根据以下公式计算得到处理采样点的数量:As another example, the text that the voice device needs to synthesize is "We love Tiananmen". When recording a voice file, it is recorded in the form of words, that is, the voice file includes the three words "we", "love", and "Tiananmen" Voice file, the voice file is downloaded from the server in advance and saved in the local directory of the voice device. After receiving the text "We Love Tiananmen" to be synthesized, the server will perform word segmentation processing on the text according to the form of a voice file, and the word segmentation processing may be completed by a word segmentation algorithm. Divide the text into "we / love / tiananmen", and then send the text after word segmentation to the speech synthesis device. After receiving the text, the speech synthesis device will first obtain three of "us", "love", and "tiananmen" Word speech file, in which the sampling frequency is 48K, the sampling precision is 8 bits, and the audio duration of the pronunciation of the three words is 2s, 1s, and 3s. Then first process "we" and "love". Before processing, you need to calculate the number of sampling points according to the following formula:
Figure PCTCN2019098086-appb-000003
Figure PCTCN2019098086-appb-000003
其中,Nw的不同取值表示当前相邻两音节是否组成词语或四字成语,如果相邻两音节为一个词语,则所述Nw取2,如果相邻两音节不在一个词语或四字成语中,则Nw取1,如果相邻两音节在一个四字成语中,则Nw取2。SNpre表示前一个音节的采样数量,SNnext表示后一个音节的采样数量;末尾平均音强pre表示前一个音节最后15%的采样点的平均音强;开头平均音强next表示后一个音节前20%的采样点的平均音强,M1、M2为整数。Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or four-word idiom. If the two adjacent syllables are one word, the Nw takes 2 if the two adjacent syllables are not in a word or four-word idiom , Then Nw takes 1; if two adjacent syllables are in a four-word idiom, Nw takes 2. SNpre represents the number of samples of the previous syllable, SNnext represents the number of samples of the following syllable; the average sound intensity at the end pre represents the average sound intensity of the last 15% of the sampling points of the previous syllable; % The average sound intensity of the sampling points, M1 and M2 are integers.
根据采样频率和音频时长,可计算得到SNpre=96K,SNnext=48K,“我们”最后15%的采样点的音强平均值为0.2,“爱”前20%的采样点的平均音强为0.3,前后音节不组成词语,Nw=1,将这些数据代入公式可计算得到处理采样点数量为5689,即从语音文件中获取“我们”最后5689个采样点的音强数据和“爱”前面5689个采样点音强的数据。在获取处理采样点的音强数据后,将“我们”每个采样点的音强乘以一定的权重,再将“爱”每个采样点的音强乘以一定的权重,然后再相加,得到处理部分的音强。同理,“爱”和“天安门”也采用同样的处理方法,得到合成后的文本“我们”、“爱”、“天安门”。According to the sampling frequency and audio duration, it can be calculated that SNpre = 96K, SNnext = 48K, the average intensity of the last 15% of the sampling points of "us" is 0.2, and the average intensity of the first 20% of the sampling points of "love" is 0.3 , Before and after the syllables do not form words, Nw = 1, the data can be calculated by substituting these data into the formula to process the number of sampling points is 5689, that is, the sound intensity data of the last 5689 sampling points of "us" and "love" in front of "689" are obtained from the voice file The data of the intensity of a sample point. After acquiring the sound intensity data of the processing sampling points, multiply the sound intensity of each sampling point of "we" by a certain weight, and then multiply the sound intensity of each sampling point of "we" by a certain weight, and then add To get the intensity of the processed part. In the same way, "love" and "Tiananmen" also use the same processing method to obtain the synthesized texts "we", "love", and "Tiananmen".
与上述一种语音合成方法相对应,本说明书还提供了一种语音合成装置,如图3所示,所述语音合成装置300包括:Corresponding to the above speech synthesis method, this specification also provides a speech synthesis device. As shown in FIG. 3, the speech synthesis device 300 includes:
获取单元301,获取待合成语音的文本中的各音节的语音文件,所述语音文件存储 有所述音节的采样点的音强数据;以及从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;The obtaining unit 301 obtains the voice files of each syllable in the text of the voice to be synthesized, the voice file stores the sound intensity data of the sampling points of the syllables; and the specified sampling points are respectively obtained from the voice files of two adjacent syllables The sound intensity data; where the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;
处理单元302,将两个音节的所述指定采样点的音强数据进行处理,以获得合成后的语音。The processing unit 302 processes the sound intensity data of the designated sampling points of two syllables to obtain synthesized speech.
在一个实施例中,所述语音文件记录有:音节的音频时长、采样点的音强数据、采样频率、采样精度和/或采样点数量。In one embodiment, the voice file records: audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy, and / or number of sampling points.
在一个实施例中,将两个音节的所述指定采样点的音强数据进行处理具体包括:In one embodiment, processing the intensity data of the specified sampling points of two syllables specifically includes:
将前一个音节的最后N个采样点的音强与后一个音节的前N个采样点音强数据相加;或Add the intensity of the last N sampling points of the previous syllable to the intensity data of the first N sampling points of the following syllable; or
将前一个音节的最后N个采样点的音强数据与后一个音节的前N个采样点音强数据分别乘以预设权重后再相加,其中,所述预设权重基于音节的前后顺序与采样点的前后顺序设定。Multiply the sound intensity data of the last N sampling points of the previous syllable and the sound intensity data of the first N sampling points of the following syllable by the preset weights, where the preset weights are based on the order of the syllables Set in order with the sampling point.
在一个实施例中,所述待合成语音的文本为中文,所述语音文件为记录有汉字音节的四个声调的语音文件。In one embodiment, the text of the speech to be synthesized is Chinese, and the speech file is a speech file recorded with four tones of syllables of Chinese characters.
在一个实施例中,所述指定采样点的音强数据与所述音节的采样点的最大音强数据的比值小于0.5。In one embodiment, the ratio of the intensity data of the specified sampling point to the maximum intensity data of the sampling point of the syllable is less than 0.5.
在一个实施例中,所述N基于相邻两音节是否组成词语或四字成语、相邻两音节的采样点数量、相邻两音节的最后M1个采样点的平均音强和/或相邻两音节前M2个采样点的平均音强计算得到,其中M1、M2为整数In one embodiment, the N is based on whether two adjacent syllables form a word or a four-character idiom, the number of sampling points of two adjacent syllables, the average intensity of the last M1 sampling points of two adjacent syllables, and / or adjacent The average sound intensity of the M2 sampling points before the two syllables is calculated, where M1 and M2 are integers
在一个实施例中,所述数M1为前一个音节音频采样点总数的20%,所述M2为后一个音节音频采样点总数的20%。In one embodiment, the number M1 is 20% of the total number of audio sampling points of the previous syllable, and the M2 is 20% of the total number of audio sampling points of the following syllable.
在一个实施例中,如果相邻两音节为一个词语,则所述转化系数为2,如果相邻两音节不在一个词语或四字成语中,则转化系数1,如果相邻两音节在一个四字成语中,则转化系数为2。In one embodiment, if the two adjacent syllables are one word, the conversion coefficient is 2, if the two adjacent syllables are not in a word or four-word idiom, the conversion coefficient is 1, if the two adjacent syllables are in a four In the word idiom, the conversion factor is 2.
在一个实施例中,所述N具体计算公式如下:In one embodiment, the specific calculation formula of N is as follows:
Figure PCTCN2019098086-appb-000004
Figure PCTCN2019098086-appb-000004
Figure PCTCN2019098086-appb-000005
Figure PCTCN2019098086-appb-000005
其中,Nw的不同取值表示当前相邻两音节是否组成词语或四字成语,SNpre表示前一个音节的采样数量,SNnext表示后一个音节的采样数量;末尾平均音强pre表示前一个音节最后M1个采样点的平均音强;开头平均音强next表示后一个音节前M2个采样点的平均音强,M1、M2为整数。Among them, the different values of Nw indicate whether the current two adjacent syllables form a word or a four-character idiom, SNpre indicates the number of samples of the previous syllable, SNnext indicates the number of samples of the following syllable; the average sound intensity at the end pre indicates the last M1 The average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable, and M1 and M2 are integers.
在一个实施例中,在获取待合成语音的文本中的各音节的语音文件之前,还包括:In one embodiment, before acquiring the speech files of each syllable in the text of the speech to be synthesized, the method further includes:
对所述文本进行分词处理。Word segmentation processing is performed on the text.
在一个实施例中,对所述文本进行分词处理由服务器端完成。In one embodiment, the word segmentation processing of the text is done by the server.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For the implementation process of the functions and functions of the units in the above device, please refer to the implementation process of the corresponding steps in the above method for details, which will not be repeated here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to the description of the method embodiments. The device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located One place, or can be distributed to multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solution in this specification. Those of ordinary skill in the art can understand and implement without paying creative labor.
另外,本说明书还提供了一中语音合成设备,如图4所示,所述语音合成设备包括:处理器401和存储器402;In addition, this specification also provides a speech synthesis device. As shown in FIG. 4, the speech synthesis device includes: a processor 401 and a memory 402;
所述存储器用于存储可执行的计算机指令;The memory is used to store executable computer instructions;
所述处理器用于执行所述计算机指令时实现以下步骤:When the processor is used to execute the computer instructions, the following steps are realized:
获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;Acquiring a voice file of each syllable in the text of the voice to be synthesized, the voice file storing sound intensity data of sampling points of the syllable;
从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables; the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer;
将两个音节的所述指定采样点的音强进行处理,以获得合成后的语音。The sound intensity of the specified sampling points of the two syllables is processed to obtain synthesized speech.
以上所述仅为本说明书的较佳实施例而已,并不用以限制本说明书,凡在本说明书 的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本说明书保护的范围之内。The above are only the preferred embodiments of this specification and are not intended to limit this specification. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of this specification should be included in this specification Within the scope of protection.

Claims (13)

  1. 一种语音合成的方法,所述方法包括:A method of speech synthesis, the method comprising:
    获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;Acquiring a voice file of each syllable in the text of the voice to be synthesized, the voice file storing sound intensity data of sampling points of the syllable;
    从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;Obtain the sound intensity data of the specified sampling point from the voice files of two adjacent syllables; the specified sampling point of the previous syllable is the last N sampling points of the syllable, and the specified sampling point of the latter syllable is the front of the syllable N sampling points, where N is an integer;
    将两个音节的所述指定采样点的音强数据进行处理,以获得合成后的语音。The sound intensity data of the specified sampling points of two syllables is processed to obtain synthesized speech.
  2. 如权利要求1所述的一种语音合成的方法,所述语音文件记录有:音节的音频时长、采样点的音强数据、采样频率、采样精度和/或采样点数量。The method of speech synthesis according to claim 1, wherein the speech file records: audio duration of syllables, sound intensity data of sampling points, sampling frequency, sampling accuracy and / or number of sampling points.
  3. 如权利要求1所述的一种语音合成的方法,将两个音节的所述指定采样点的音强进行处理具体包括:The method of speech synthesis according to claim 1, wherein processing the sound intensity of the designated sampling point of two syllables specifically includes:
    将前一个音节的最后N个采样点的音强数据与后一个音节的前N个采样点音强数据相加;或Add the intensity data of the last N sampling points of the previous syllable and the intensity data of the first N sampling points of the following syllable; or
    将前一个音节的最后N个采样点的音强数据与后一个音节的前N个采样点音强数据分别乘以预设权重后再相加,其中,所述预设权重基于音节的前后顺序与采样点的前后顺序设定。Multiply the sound intensity data of the last N sampling points of the previous syllable and the sound intensity data of the first N sampling points of the following syllable by the preset weights, where the preset weights are based on the order of the syllable Set in order with the sampling point.
  4. 如权利要求1所述的一种语音合成的方法,所述待合成语音的文本为中文,所述语音文件为记录有汉字音节的四个声调的语音文件。The method of speech synthesis according to claim 1, wherein the text of the speech to be synthesized is Chinese, and the speech file is a speech file recorded with four tones of syllables of Chinese characters.
  5. 如权利要求1所述的一种语音合成的方法,所述指定采样点的音强数据与所述音节的各采样点的最大音强数据的比值小于0.5。The method of speech synthesis according to claim 1, wherein the ratio of the sound intensity data of the designated sampling point to the maximum sound intensity data of each sampling point of the syllable is less than 0.5.
  6. 如权利要求1所述的一种语音合成的方法,所述N基于相邻两音节是否组成词语或四字成语、相邻两音节的采样点数量、相邻两音节的最后M1个采样点的平均音强和/或相邻两音节前M2个采样点的平均音强确定,其中M1、M2为整数。A method of speech synthesis according to claim 1, said N is based on whether two adjacent syllables form a word or a four-character idiom, the number of sampling points of two adjacent syllables, and the last M1 sampling points of two adjacent syllables The average sound intensity and / or the average sound intensity of the M2 sampling points before the two adjacent syllables are determined, where M1 and M2 are integers.
  7. 如权利要求6所述的一种语音合成的方法,所述M1为前一个音节采样点总数的20%,所述M2为后一个音节采样点总数的20%。A method of speech synthesis according to claim 6, wherein M1 is 20% of the total number of sampling points of the previous syllable, and M2 is 20% of the total number of sampling points of the following syllable.
  8. 如权利要求6所述的一种语音合成的方法,所述N具体计算公式如下:A method of speech synthesis according to claim 6, the specific calculation formula of N is as follows:
    Figure PCTCN2019098086-appb-100001
    Figure PCTCN2019098086-appb-100001
    其中,Nw的不同取值表示相邻两音节是否组成词语或四字成语,SNpre表示前一 个音节的采样数量,SNnext表示后一个音节的采样数量;末尾平均音强pre表示前一个音节最后M1个采样点的平均音强;开头平均音强next表示后一个音节前M2个采样点的平均音强。Among them, the different values of Nw indicate whether two adjacent syllables form a word or a four-character idiom, SNpre indicates the number of samples of the previous syllable, SNnext indicates the number of samples of the following syllable; the average sound intensity at the end pre indicates the last M1 The average sound intensity of the sampling points; the average sound intensity at the beginning next indicates the average sound intensity of the M2 sampling points before the next syllable.
  9. 如权利要求8所述的一种语音合成的方法,如果相邻两音节为一个词语,则所述Nw的值为2,如果相邻两音节不在一个词语或四字成语中,则Nw的值为1,如果相邻两音节不在一个词语且在一个四字成语中,则Nw的值为2。A method of speech synthesis according to claim 8, if the two adjacent syllables are one word, the value of Nw is 2, if the two adjacent syllables are not in a word or four-word idiom, the value of Nw Is 1, if the adjacent two syllables are not in a word and in a four-word idiom, the value of Nw is 2.
  10. 如权利要求1所述的一种语音合成的方法,在获取待合成语音的文本中的各音节的语音文件之前,还包括:The method of speech synthesis according to claim 1, before acquiring the speech files of each syllable in the text of the speech to be synthesized, further comprising:
    对所述文本进行分词处理。Word segmentation processing is performed on the text.
  11. 如权利要求10所述的一种语音合成的方法,对所述文本进行分词处理由服务器端完成。A method of speech synthesis according to claim 10, the word segmentation processing of the text is done by the server.
  12. 一种语音合成装置,所述装置包括:A speech synthesis device, the device includes:
    获取单元,获取待合成语音的文本中的各音节的语音文件,所述语音文件存储有所述音节的采样点的音强数据;以及从相邻两音节的语音文件中分别获取指定采样点的音强数据;其中,前一音节的指定采样点为该音节的最后N个采样点,后一音节的指定采样点为该音节的前N个采样点,其中,N为整数;An acquiring unit, acquiring a voice file of each syllable in the text of the speech to be synthesized, the voice file storing sound intensity data of sampling points of the syllables; and separately acquiring the specified sampling points from the voice files of two adjacent syllables Intensity data; where the designated sampling point of the previous syllable is the last N sampling points of the syllable, and the designated sampling point of the following syllable is the first N sampling points of the syllable, where N is an integer;
    处理单元,将两个音节的所述指定采样点的音强数据进行处理,以获得合成后的语音。The processing unit processes the intensity data of the designated sampling points of two syllables to obtain synthesized speech.
  13. 一种语音合成设备,所述语音合成设备包括:处理器和存储器;A speech synthesis device, the speech synthesis device includes: a processor and a memory;
    所述存储器用于存储可执行的计算机指令;The memory is used to store executable computer instructions;
    所述处理器用于执行所述计算机指令时实现权利要求1至11任一所述方法的步骤。The processor is used to implement the steps of the method according to any one of claims 1 to 11 when executing the computer instructions.
PCT/CN2019/098086 2018-10-29 2019-07-29 Speech synthesis method, device, and apparatus WO2020088006A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811269226.6A CN109599090B (en) 2018-10-29 2018-10-29 Method, device and equipment for voice synthesis
CN201811269226.6 2018-10-29

Publications (1)

Publication Number Publication Date
WO2020088006A1 true WO2020088006A1 (en) 2020-05-07

Family

ID=65958614

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/098086 WO2020088006A1 (en) 2018-10-29 2019-07-29 Speech synthesis method, device, and apparatus

Country Status (3)

Country Link
CN (1) CN109599090B (en)
TW (1) TWI731382B (en)
WO (1) WO2020088006A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109599090B (en) * 2018-10-29 2020-10-30 创新先进技术有限公司 Method, device and equipment for voice synthesis
CN111145723B (en) * 2019-12-31 2023-11-17 广州酷狗计算机科技有限公司 Method, device, equipment and storage medium for converting audio
CN111883100B (en) * 2020-07-22 2021-11-09 马上消费金融股份有限公司 Voice conversion method, device and server
CN112185338B (en) * 2020-09-30 2024-01-23 北京大米科技有限公司 Audio processing method, device, readable storage medium and electronic equipment
CN112562635B (en) * 2020-12-03 2024-04-09 云知声智能科技股份有限公司 Method, device and system for solving generation of pulse signals at splicing position in speech synthesis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787072A (en) * 2004-12-07 2006-06-14 北京捷通华声语音技术有限公司 Method for synthesizing pronunciation based on rhythm model and parameter selecting voice
CN103020232A (en) * 2012-12-14 2013-04-03 沈阳美行科技有限公司 Method for recording individual characters into navigation system
CN105895076A (en) * 2015-01-26 2016-08-24 科大讯飞股份有限公司 Speech synthesis method and system
US20160365087A1 (en) * 2015-06-12 2016-12-15 Geulah Holdings Llc High end speech synthesis
CN109599090A (en) * 2018-10-29 2019-04-09 阿里巴巴集团控股有限公司 A kind of method, device and equipment of speech synthesis

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5748840A (en) * 1990-12-03 1998-05-05 Audio Navigation Systems, Inc. Methods and apparatus for improving the reliability of recognizing words in a large database when the words are spelled or spoken
CN1118493A (en) * 1994-08-01 1996-03-13 中国科学院声学研究所 Language and speech converting system with synchronous fundamental tone waves
NZ304418A (en) * 1995-04-12 1998-02-26 British Telecomm Extension and combination of digitised speech waveforms for speech synthesis
US20030135374A1 (en) * 2002-01-16 2003-07-17 Hardwick John C. Speech synthesizer
US7328076B2 (en) * 2002-11-15 2008-02-05 Texas Instruments Incorporated Generalized envelope matching technique for fast time-scale modification
CN1262987C (en) * 2003-10-24 2006-07-05 无敌科技股份有限公司 Smoothly processing method for conversion of intervowel
CN101000766B (en) * 2007-01-09 2011-02-02 黑龙江大学 Chinese intonation base frequency contour generating method based on intonation model
CN101710488B (en) * 2009-11-20 2011-08-03 安徽科大讯飞信息科技股份有限公司 Method and device for voice synthesis
CN107039033A (en) * 2017-04-17 2017-08-11 海南职业技术学院 A kind of speech synthetic device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787072A (en) * 2004-12-07 2006-06-14 北京捷通华声语音技术有限公司 Method for synthesizing pronunciation based on rhythm model and parameter selecting voice
CN103020232A (en) * 2012-12-14 2013-04-03 沈阳美行科技有限公司 Method for recording individual characters into navigation system
CN105895076A (en) * 2015-01-26 2016-08-24 科大讯飞股份有限公司 Speech synthesis method and system
US20160365087A1 (en) * 2015-06-12 2016-12-15 Geulah Holdings Llc High end speech synthesis
CN109599090A (en) * 2018-10-29 2019-04-09 阿里巴巴集团控股有限公司 A kind of method, device and equipment of speech synthesis

Also Published As

Publication number Publication date
CN109599090A (en) 2019-04-09
TW202036534A (en) 2020-10-01
TWI731382B (en) 2021-06-21
CN109599090B (en) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2020088006A1 (en) Speech synthesis method, device, and apparatus
US10115389B2 (en) Speech synthesis method and apparatus
CN108831437B (en) Singing voice generation method, singing voice generation device, terminal and storage medium
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
CN107423364B (en) Method, device and storage medium for answering operation broadcasting based on artificial intelligence
WO2021083071A1 (en) Method, device, and medium for speech conversion, file generation, broadcasting, and voice processing
CN108573694B (en) Artificial intelligence based corpus expansion and speech synthesis system construction method and device
WO2020113733A1 (en) Animation generation method and apparatus, electronic device, and computer-readable storage medium
US10971125B2 (en) Music synthesis method, system, terminal and computer-readable storage medium
US8682678B2 (en) Automatic realtime speech impairment correction
CN107705782B (en) Method and device for determining phoneme pronunciation duration
US20190371291A1 (en) Method and apparatus for processing speech splicing and synthesis, computer device and readable medium
JP2007242012A (en) Method, system and program for email administration for email rendering on digital audio player (email administration for rendering email on digital audio player)
WO2019007308A1 (en) Voice broadcasting method and device
CN111105779B (en) Text playing method and device for mobile client
CN110675886A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
JP2019015951A (en) Wake up method for electronic device, apparatus, device and computer readable storage medium
CN113724683B (en) Audio generation method, computer device and computer readable storage medium
WO2016165334A1 (en) Voice processing method and apparatus, and terminal device
US8655466B2 (en) Correlating changes in audio
WO2022143530A1 (en) Audio processing method and apparatus, computer device, and storage medium
CN109495786B (en) Pre-configuration method and device of video processing parameter information and electronic equipment
CN114333758A (en) Speech synthesis method, apparatus, computer device, storage medium and product
CN112837688A (en) Voice transcription method, device, related system and equipment
CN114822492B (en) Speech synthesis method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19880300

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19880300

Country of ref document: EP

Kind code of ref document: A1