CN112562635B - Method, device and system for solving generation of pulse signals at splicing position in speech synthesis - Google Patents

Method, device and system for solving generation of pulse signals at splicing position in speech synthesis Download PDF

Info

Publication number
CN112562635B
CN112562635B CN202011396383.0A CN202011396383A CN112562635B CN 112562635 B CN112562635 B CN 112562635B CN 202011396383 A CN202011396383 A CN 202011396383A CN 112562635 B CN112562635 B CN 112562635B
Authority
CN
China
Prior art keywords
coefficient vector
voice
sampling point
point value
fading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011396383.0A
Other languages
Chinese (zh)
Other versions
CN112562635A (en
Inventor
高洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202011396383.0A priority Critical patent/CN112562635B/en
Publication of CN112562635A publication Critical patent/CN112562635A/en
Application granted granted Critical
Publication of CN112562635B publication Critical patent/CN112562635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a method, a device and a system for solving the problem that pulse signals are generated at a splicing part in voice synthesis, wherein the method comprises the following steps: extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points from the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N is greater than 256; according to the sampling points, calculating a gradually-out coefficient vector and a gradually-in coefficient vector; obtaining a sampling point value of the overlapped part by utilizing the gradually-out coefficient vector and the gradually-in coefficient vector; and based on the obtained sampling point value of the overlapped part, the splicing of the first voice fragment and the second voice fragment is completed. The method does not concern the source of pulse signals generated at the splicing position during speech synthesis, achieves the effect of speech smoothing after weighted average of front and rear section sampling point values, greatly improves the negative influence of the pulse signals on the overall rhythm, tone quality and hearing of speech, and can reduce or avoid noise.

Description

Method, device and system for solving generation of pulse signals at splicing position in speech synthesis
Technical Field
One or more embodiments of the present invention relate to the field of natural language processing, and in particular, to a method, an apparatus, and a system for generating a pulse signal at a splice in speech synthesis.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Accordingly, unless indicated otherwise, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
With the rapid development of intelligent voice technology, voice interaction has become a necessary scheme for man-machine interaction in numerous intelligent devices, such as more and more enterprises and institutions pushing products based on voice interaction technology, such as a voice ordering system. The product analyzes the voice input of the user by utilizing the technologies of voice recognition, natural language processing and the like to finish corresponding operations or tasks, such as ordering operations or ordering tasks, and an essential ring in the human-computer interaction process is the voice synthesis technology.
Most of the existing speech synthesis applications are online application scenarios. Compared with an offline application scene, the online scene cannot wait for complete speech synthesis, then carries out network transmission or playing of speech, and needs to carry out streaming processing, and the speech is transmitted or played after being generated in a segmentation mode.
However, the prior art has the following problems: in the speech synthesis stream processing, each time a speech fragment is returned, and the front and rear speech fragments are independently calculated. The two sections of voice often generate pulse signals at the spliced position, and the integral rhythm, tone quality and hearing of the voice are greatly and negatively influenced.
In view of this, a technology is needed for solving the problem of generating a pulse signal at a splicing part in a speech synthesis stream processing so as to eliminate the negative influence on the overall rhythm, tone quality and auditory sense of speech.
Disclosure of Invention
One or more embodiments of the present disclosure describe a method, an apparatus, and a system for generating a pulse signal at a splice in speech synthesis, which can solve the problem that in the speech synthesis process existing in the prior art, two segments of speech often generate a pulse signal at a splice, and noise is easy to generate.
One or more embodiments of the present disclosure provide the following technical solutions:
in a first aspect, the present invention provides a method for solving a problem of generating a pulse signal at a splice in speech synthesis, wherein the method includes the steps of:
extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points from the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N is greater than 256;
according to the sampling points, calculating a gradually-out coefficient vector and a gradually-in coefficient vector;
obtaining a sampling point value of the overlapped part by utilizing the gradually-out coefficient vector and the gradually-in coefficient vector;
and based on the obtained sampling point value of the overlapped part, the splicing of the first voice fragment and the second voice fragment is completed.
In one possible implementation manner, the sample point value of the overlapping portion is obtained by using the fading-out coefficient vector and the fading-in coefficient vector, which specifically is:
and obtaining the sampling point value of the overlapped part according to the sampling point value in the first voice segment, the sampling point value in the second voice segment, the fading-out coefficient vector and the fading-in coefficient vector.
In one possible implementation, the fading coefficient vector is calculated as follows:
in one possible implementation, the fading coefficient vector is calculated as follows:
in one possible implementation, the sample point value of the overlap is calculated as follows:
samples=fadeout coef *samplesout+fadein coef *samplesin
wherein fadeout coef Sampling point values in the first voice segment; fadesin coef Sampling point values in the second speech segment; samplesout is a gradient coefficient vector; samplesin is a gradient coefficient vector.
In a second aspect, the invention provides a device for solving the problem of generating a pulse signal at a splicing position in speech synthesis, which comprises a database unit, a sampling point selection unit, a first calculation unit, a second calculation unit and a splicing unit;
the database is used for storing voice fragments;
the sampling point selection unit is used for reserving N sampling points from the first voice fragment and the second voice fragment as overlapping parts;
the first calculating unit is used for calculating a fading-out coefficient vector and a fading-in coefficient vector;
the second calculating unit is used for obtaining sampling point values of the overlapped part based on the fading-out coefficient vector and the fading-in coefficient vector;
and the splicing unit is used for finishing the splicing of the first voice fragment and the second voice fragment based on the obtained sampling point value of the overlapped part.
In a third aspect, the present invention provides a system for resolving the generation of a pulse signal at a splice in speech synthesis, the system comprising at least one processor and a memory;
the memory is used for storing one or more program instructions;
the processor is configured to execute one or more program instructions to perform the method as described in one or more of the first aspects.
In a fourth aspect, the present invention provides a chip coupled to a memory in a system such that the chip, when run, invokes program instructions stored in the memory to implement a method as described in one or more of the first aspects.
In a fifth aspect, the present invention provides a computer readable storage medium comprising one or more program instructions executable by a system as described in the third aspect to implement a method as described in one or more of the first aspects.
The method provided by the embodiment of the invention does not concern the source of the pulse signal generated at the splicing position during the voice synthesis, achieves the effect of voice smoothing after the weighted average of the front and rear section sampling point values, greatly improves the negative influence of the pulse signal on the integral rhythm, tone quality and auditory sense of the voice, and can lighten or avoid the occurrence of noise phenomenon.
Drawings
Fig. 1 is a schematic flow chart of a method for solving the problem of generating a pulse signal at a splicing position in speech synthesis according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a three-stage return process for speech synthesis according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a device for solving the problem of generating a pulse signal at a splicing position in speech synthesis according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a system for solving the problem of generating a pulse signal at a splicing position in speech synthesis according to an embodiment of the present invention.
Detailed Description
The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows a flowchart of a method for solving a problem of generating a pulse signal at a splice in speech synthesis, where the method may be implemented by any apparatus, device, platform, or cluster of devices with computing and processing capabilities. As shown in fig. 1, the method comprises the steps of:
step 10, extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points in the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N > 256.
Specifically, N sampling points are reserved as overlapping portions of the OverLap when the front and rear voice segments are calculated, wherein the front segment of the voice segments of the front and rear segments is called a first voice segment, and the rear segment is called a second voice segment. The proposed sampling point N > 256.
The last N sample points of the first speech segment are:
samplesout=(sampout 1 ,sampout 2 ,...,sampout N )
the first N sampling points of the second speech segment are:
samplesin=(sampin 1 ,sampin 2 ,...,sampin N )
and step 20, calculating a fading-out coefficient vector and a fading-in coefficient vector according to the sampling points.
Specifically, samplesout is a gradient coefficient vector; samplesin is a gradient coefficient vector.
The fading coefficient vector is calculated as follows:
the fading coefficient vector is calculated as follows:
and step 30, obtaining the sampling point value of the overlapped part by using the gradually-out coefficient vector and the gradually-in coefficient vector.
Specifically, according to the sampling point value in the first voice segment and the sampling point value in the second voice segment, and the gradually-out coefficient vector and the gradually-in coefficient vector, the sampling point value of the overlapped part is obtained.
The sample point value of the overlap is calculated as:
samples=fadeout coef *samplesout+fadein coef *samplesin
wherein fadeout coef Sampling point values in the first voice segment; fadesin coef Sampling point values in the second speech segment; samplesout is a gradient coefficient vector; samplesin is a gradual inCoefficient vectors.
The problem of generating pulse signals during synthesis can be well solved after the weighted average is carried out on the sampling point values at the splicing position, because the stream processing splicing position is extremely easy to generate pulse signals, although the pulse signals are usually generated by few points, the numerical value is large, and the weighted average process is used for reducing the multiplied result by giving a small weight to the sampling point values although the value is large, so that the problem of noise generation is reduced or avoided.
And step 40, based on the obtained sampling point value of the overlapped part, completing the splicing of the first voice fragment and the second voice fragment.
The method is described in detail below with reference to specific examples:
fig. 2 is a schematic diagram of a three-stage return process of speech synthesis according to an embodiment of the present invention, as shown in fig. 2, where OR represents Right OverLap and OL represents Left OverLap.
OR is a plurality of frame sampling points, the given linear fading coefficient is N, if the number of the sampling points is N, the sampling points of the section are multiplied by the coefficients (N-1)/N, (N-2)/N, (N-3)/N … respectively, and the sampling points of the section are recorded in the memory without returning.
OL is a number of frame sampling points, given a linear progressive coefficient, for example, the number of sampling points is N, and the sampling points of the section are multiplied by coefficients of 1/N,2/N and 3/N … respectively.
Assuming that the total length of the return voice is 500ms and the sampling rate is 22050, the number of sampling points of each segment is 22050×500/1000=11025, and here, assume that the OverLap is n=256+1.
The fade-out coefficient is: (256/257,254/257, …/257,1/257)
The fade-in coefficient is: (1/257,2/257, … 255/257,256/257)
The total number of sampling points returned for the first time is 11025-256, and the OR part is used as the next time to return;
and in the second return, firstly, using the OR of the previous section and the OL of the current time to calculate the value of the OverLap, wherein the calculation method is as follows:
samples=fadeout coef *samplesout+fadein coef *samplesin
the following is a brief description:
let 256 values of OR and OL be r1, r2, … r256 and l1, l2, …, l256, respectively.
The segment computes as: overLap= (r1×256/257+l1×1/257, r2×255/257+l2×2/257, …, r256×1/257+l256×256/257), 256 values in total, the second segment removes the OL & OR segment, leaving the DATA segment in total 11025-256×2, and the above calculated OverLap is returned with the current segment DATA, leaving the OR segment untreated.
And returning all the rest voices for the third time, and similarly to the second time, calculating the OverLap firstly, and simultaneously returning all the rest voices for the third time, so that an OR segment does not need to be cut off, and two ends are combined together to return simultaneously.
The problem that the stream processing splice is very easy to generate pulse signals is solved for various reasons. The method does not care the generated source, achieves the effect of voice smoothing after the weighted average of the front and back section sampling point values, and can greatly improve the occurrence of the phenomenon. Corresponding to the method of the above embodiment, the present invention further provides a device for solving the problem of generating a pulse signal at a splicing position in speech synthesis, as shown in fig. 3, where the device for solving the problem of generating a pulse signal at a splicing position in speech synthesis includes: a database unit 310, a sampling point selection unit 330, a first calculation unit 330, a second calculation unit 340 and a stitching unit 350. Specific:
the database 310 is used for storing voice fragments;
the sampling point selecting unit 330 is configured to reserve N sampling points from the first speech segment and the second speech segment as overlapping portions;
the first calculating unit 330 is configured to calculate a fading-out coefficient vector and a fading-in coefficient vector;
the second calculating unit 340 is configured to obtain a sampling point value of the overlapping portion based on the fading-out coefficient vector and the fading-in coefficient vector;
the splicing unit 350 is configured to finish the splicing of the first speech segment and the second speech segment based on the obtained sampling point value of the overlapping portion.
The functions performed by each component in the device for solving the problem of generating the pulse signal at the splicing position in the speech synthesis provided by the embodiment of the invention are described in detail in the above method, so that redundant description is omitted here.
Corresponding to the above embodiment, the present invention further provides a system for solving the problem of generating a pulse signal at a splice in speech synthesis, as shown in fig. 4, where the system includes at least one processor 410 and a memory 420;
a memory 410 for storing one or more program instructions;
the processor 420 is configured to execute one or more program instructions to perform any of the method steps described in the above embodiments for a method for solving a problem of generating a pulse signal at a splice in speech synthesis.
Corresponding to the above embodiment, the embodiment of the present invention further provides a chip, which is coupled to the memory in the above system, so that the chip invokes the program instructions stored in the memory when running, to implement the method for solving the problem of generating a pulse signal at the splice in speech synthesis as described in the above embodiment.
Corresponding to the above embodiments, the present invention further provides a computer storage medium, where the computer storage medium includes one or more programs, where one or more program instructions are configured to execute the method for solving the problem of generating a pulse signal at a splice in speech synthesis as described above by a system for solving the problem of generating a pulse signal at a splice in speech synthesis.
The method provided by the embodiment of the invention does not concern the source of the pulse signal generated at the splicing position during the voice synthesis, carries out weighted average on the values of the front and rear sampling points, and aims at the situation that the pulse signal value is large and is given with small weight, the value after the product becomes smaller, thereby achieving the effect of voice smoothing, reducing or avoiding the occurrence of noise phenomenon and greatly improving the negative influence of the pulse signal on the overall rhythm, tone quality and hearing of voice.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the invention is not limited to the particular embodiments disclosed, but is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the invention.

Claims (6)

1. A method for solving the problem of generating a pulse signal at a splice in speech synthesis, the method comprising the steps of:
extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points from the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N is greater than 256;
according to the sampling points, calculating a gradually-out coefficient vector and a gradually-in coefficient vector;
obtaining a sampling point value of the overlapped part by utilizing the gradually-out coefficient vector and the gradually-in coefficient vector;
based on the obtained sampling point value of the overlapped part, the splicing of the first voice fragment and the second voice fragment is completed;
the fading coefficient vector is calculated as follows:
the fade-in coefficient vector is calculated as follows:
the sample point value of the overlap is calculated as:
samples=fadeout coef *samplesout+fadein coef *samplesin
wherein fadeout coef Is a gradient coefficient vector; fadesin coef Is a progressive coefficient vector;
samplesout is a sampling point value in the first voice segment; samplesin is the sample point value in the second speech segment.
2. The method according to claim 1, wherein the sample point values of the overlapping portion are obtained by using the fade-out coefficient vector and the fade-in coefficient vector, specifically:
and obtaining the sampling point value of the overlapped part according to the sampling point value in the first voice segment, the sampling point value in the second voice segment, the fading-out coefficient vector and the fading-in coefficient vector.
3. The device for solving the problem of generating pulse signals at the splicing position in the voice synthesis is characterized by comprising a database unit, a sampling point selection unit, a first calculation unit, a second calculation unit and a splicing unit;
the database is used for storing voice fragments;
the sampling point selection unit is used for reserving N sampling points from the first voice fragment and the second voice fragment as overlapping parts;
the first calculating unit is used for calculating a fading-out coefficient vector and a fading-in coefficient vector;
the second calculating unit is used for obtaining sampling point values of the overlapped part based on the fading-out coefficient vector and the fading-in coefficient vector;
the splicing unit is used for completing the splicing of the first voice fragment and the second voice fragment based on the obtained sampling point value of the overlapped part;
the fading coefficient vector is calculated as follows:
the fade-in coefficient vector is calculated as follows:
the sample point value of the overlap is calculated as:
samples=fadeout coef *samplesout+fadein coef *samplesin
wherein fadeout coef Is a gradient coefficient vector; fadesin coef Is a progressive coefficient vector;
samplesout is a sampling point value in the first voice segment; samplesin is the sample point value in the second speech segment.
4. The system for solving the problem of generating pulse signals at the splicing position in the voice synthesis is characterized by comprising at least one processor and a memory;
the memory is used for storing one or more program instructions;
the processor is configured to execute one or more program instructions for performing the method of claim 1 or 2.
5. A chip, characterized in that the chip is coupled to a memory in a system such that the chip, when running, invokes program instructions stored in the memory, implementing the method according to claim 1 or 2.
6. A computer readable storage medium comprising one or more program instructions executable by the system of claim 4 to implement the method of claim 1 or 2.
CN202011396383.0A 2020-12-03 2020-12-03 Method, device and system for solving generation of pulse signals at splicing position in speech synthesis Active CN112562635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011396383.0A CN112562635B (en) 2020-12-03 2020-12-03 Method, device and system for solving generation of pulse signals at splicing position in speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011396383.0A CN112562635B (en) 2020-12-03 2020-12-03 Method, device and system for solving generation of pulse signals at splicing position in speech synthesis

Publications (2)

Publication Number Publication Date
CN112562635A CN112562635A (en) 2021-03-26
CN112562635B true CN112562635B (en) 2024-04-09

Family

ID=75047752

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011396383.0A Active CN112562635B (en) 2020-12-03 2020-12-03 Method, device and system for solving generation of pulse signals at splicing position in speech synthesis

Country Status (1)

Country Link
CN (1) CN112562635B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003066983A (en) * 2001-08-30 2003-03-05 Sharp Corp Voice synthesizing apparatus and method, and program recording medium
JP2005091747A (en) * 2003-09-17 2005-04-07 Mitsubishi Electric Corp Speech synthesizer
WO2011030424A1 (en) * 2009-09-10 2011-03-17 株式会社東芝 Voice synthesizing apparatus and program
CN104517605A (en) * 2014-12-04 2015-04-15 北京云知声信息技术有限公司 Speech segment assembly system and method for speech synthesis
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
CN109599090A (en) * 2018-10-29 2019-04-09 阿里巴巴集团控股有限公司 A kind of method, device and equipment of speech synthesis
CN109767783A (en) * 2019-02-15 2019-05-17 深圳市汇顶科技股份有限公司 Sound enhancement method, device, equipment and storage medium
CN111640456A (en) * 2020-06-04 2020-09-08 合肥讯飞数码科技有限公司 Overlapped sound detection method, device and equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013011634A1 (en) * 2011-07-19 2013-01-24 日本電気株式会社 Waveform processing device, waveform processing method, and waveform processing program
US8744854B1 (en) * 2012-09-24 2014-06-03 Chengjun Julian Chen System and method for voice transformation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003066983A (en) * 2001-08-30 2003-03-05 Sharp Corp Voice synthesizing apparatus and method, and program recording medium
JP2005091747A (en) * 2003-09-17 2005-04-07 Mitsubishi Electric Corp Speech synthesizer
WO2011030424A1 (en) * 2009-09-10 2011-03-17 株式会社東芝 Voice synthesizing apparatus and program
CN104517605A (en) * 2014-12-04 2015-04-15 北京云知声信息技术有限公司 Speech segment assembly system and method for speech synthesis
CN109036386A (en) * 2018-09-14 2018-12-18 北京网众共创科技有限公司 A kind of method of speech processing and device
CN109599090A (en) * 2018-10-29 2019-04-09 阿里巴巴集团控股有限公司 A kind of method, device and equipment of speech synthesis
CN109767783A (en) * 2019-02-15 2019-05-17 深圳市汇顶科技股份有限公司 Sound enhancement method, device, equipment and storage medium
CN111640456A (en) * 2020-06-04 2020-09-08 合肥讯飞数码科技有限公司 Overlapped sound detection method, device and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HIGH-QUALITY SPEECH SYNTHESIS SYSTEM BASED ON WAVE-FORM CONCATENATION OF PHONEME SEGMENT;HIROKAWA,T等;IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES;第E76A卷(第11期);1964-1970 *
语音合成系统中波形拼接过渡算法的研究;张鹏等;黑龙江大学自然科学学报;第28卷(第6期);867-870 *

Also Published As

Publication number Publication date
CN112562635A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
US20210233550A1 (en) Voice separation device, voice separation method, voice separation program, and voice separation system
US9767790B2 (en) Voice retrieval apparatus, voice retrieval method, and non-transitory recording medium
CN117043855A (en) Unsupervised parallel Tacotron non-autoregressive and controllable text-to-speech
WO2021183229A1 (en) Cross-speaker style transfer speech synthesis
US10636412B2 (en) System and method for unit selection text-to-speech using a modified Viterbi approach
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN111508519A (en) Method and device for enhancing voice of audio signal
Luong et al. Bootstrapping non-parallel voice conversion from speaker-adaptive text-to-speech
Sainath et al. Improving the latency and quality of cascaded encoders
Oyamada et al. Non-native speech conversion with consistency-aware recursive network and generative adversarial network
CN113674733A (en) Method and apparatus for speaking time estimation
CN110164413A (en) Phoneme synthesizing method, device, computer equipment and storage medium
KR102018286B1 (en) Method and Apparatus for Removing Speech Components in Sound Source
WO2020071213A1 (en) Acoustic model learning device, voice synthesis device, and program
CN114783409A (en) Training method of speech synthesis model, speech synthesis method and device
US11017790B2 (en) Avoiding speech collisions among participants during teleconferences
US20240096332A1 (en) Audio signal processing method, audio signal processing apparatus, computer device and storage medium
CN112562635B (en) Method, device and system for solving generation of pulse signals at splicing position in speech synthesis
CN113987149A (en) Intelligent session method, system and storage medium for task robot
CN108932943A (en) Command word sound detection method, device, equipment and storage medium
JP2011170190A (en) Device, method and program for signal separation
WO2024072481A1 (en) Text to speech synthesis without using parallel text-audio data
KR102613030B1 (en) Speech synthesis method and apparatus using adversarial learning technique
US11830481B2 (en) Context-aware prosody correction of edited speech
CN114999440A (en) Avatar generation method, apparatus, device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant