CN112562635B

CN112562635B - Method, device and system for solving generation of pulse signals at splicing position in speech synthesis

Info

Publication number: CN112562635B
Application number: CN202011396383.0A
Authority: CN
Inventors: 高洋
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2024-04-09
Anticipated expiration: 2040-12-03
Also published as: CN112562635A

Abstract

The invention provides a method, a device and a system for solving the problem that pulse signals are generated at a splicing part in voice synthesis, wherein the method comprises the following steps: extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points from the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N is greater than 256; according to the sampling points, calculating a gradually-out coefficient vector and a gradually-in coefficient vector; obtaining a sampling point value of the overlapped part by utilizing the gradually-out coefficient vector and the gradually-in coefficient vector; and based on the obtained sampling point value of the overlapped part, the splicing of the first voice fragment and the second voice fragment is completed. The method does not concern the source of pulse signals generated at the splicing position during speech synthesis, achieves the effect of speech smoothing after weighted average of front and rear section sampling point values, greatly improves the negative influence of the pulse signals on the overall rhythm, tone quality and hearing of speech, and can reduce or avoid noise.

Description

Method, device and system for solving generation of pulse signals at splicing position in speech synthesis

Technical Field

One or more embodiments of the present invention relate to the field of natural language processing, and in particular, to a method, an apparatus, and a system for generating a pulse signal at a splice in speech synthesis.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Accordingly, unless indicated otherwise, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

With the rapid development of intelligent voice technology, voice interaction has become a necessary scheme for man-machine interaction in numerous intelligent devices, such as more and more enterprises and institutions pushing products based on voice interaction technology, such as a voice ordering system. The product analyzes the voice input of the user by utilizing the technologies of voice recognition, natural language processing and the like to finish corresponding operations or tasks, such as ordering operations or ordering tasks, and an essential ring in the human-computer interaction process is the voice synthesis technology.

Most of the existing speech synthesis applications are online application scenarios. Compared with an offline application scene, the online scene cannot wait for complete speech synthesis, then carries out network transmission or playing of speech, and needs to carry out streaming processing, and the speech is transmitted or played after being generated in a segmentation mode.

However, the prior art has the following problems: in the speech synthesis stream processing, each time a speech fragment is returned, and the front and rear speech fragments are independently calculated. The two sections of voice often generate pulse signals at the spliced position, and the integral rhythm, tone quality and hearing of the voice are greatly and negatively influenced.

In view of this, a technology is needed for solving the problem of generating a pulse signal at a splicing part in a speech synthesis stream processing so as to eliminate the negative influence on the overall rhythm, tone quality and auditory sense of speech.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method, an apparatus, and a system for generating a pulse signal at a splice in speech synthesis, which can solve the problem that in the speech synthesis process existing in the prior art, two segments of speech often generate a pulse signal at a splice, and noise is easy to generate.

One or more embodiments of the present disclosure provide the following technical solutions:

in a first aspect, the present invention provides a method for solving a problem of generating a pulse signal at a splice in speech synthesis, wherein the method includes the steps of:

extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points from the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N is greater than 256;

according to the sampling points, calculating a gradually-out coefficient vector and a gradually-in coefficient vector;

obtaining a sampling point value of the overlapped part by utilizing the gradually-out coefficient vector and the gradually-in coefficient vector;

and based on the obtained sampling point value of the overlapped part, the splicing of the first voice fragment and the second voice fragment is completed.

In one possible implementation manner, the sample point value of the overlapping portion is obtained by using the fading-out coefficient vector and the fading-in coefficient vector, which specifically is:

and obtaining the sampling point value of the overlapped part according to the sampling point value in the first voice segment, the sampling point value in the second voice segment, the fading-out coefficient vector and the fading-in coefficient vector.

In one possible implementation, the fading coefficient vector is calculated as follows:

in one possible implementation, the sample point value of the overlap is calculated as follows:

samples＝fadeout _coef *samplesout+fadein _coef *samplesin

wherein fadeout _coef Sampling point values in the first voice segment; fadesin _coef Sampling point values in the second speech segment; samplesout is a gradient coefficient vector; samplesin is a gradient coefficient vector.

In a second aspect, the invention provides a device for solving the problem of generating a pulse signal at a splicing position in speech synthesis, which comprises a database unit, a sampling point selection unit, a first calculation unit, a second calculation unit and a splicing unit;

the database is used for storing voice fragments;

the sampling point selection unit is used for reserving N sampling points from the first voice fragment and the second voice fragment as overlapping parts;

the first calculating unit is used for calculating a fading-out coefficient vector and a fading-in coefficient vector;

the second calculating unit is used for obtaining sampling point values of the overlapped part based on the fading-out coefficient vector and the fading-in coefficient vector;

and the splicing unit is used for finishing the splicing of the first voice fragment and the second voice fragment based on the obtained sampling point value of the overlapped part.

In a third aspect, the present invention provides a system for resolving the generation of a pulse signal at a splice in speech synthesis, the system comprising at least one processor and a memory;

the memory is used for storing one or more program instructions;

the processor is configured to execute one or more program instructions to perform the method as described in one or more of the first aspects.

In a fourth aspect, the present invention provides a chip coupled to a memory in a system such that the chip, when run, invokes program instructions stored in the memory to implement a method as described in one or more of the first aspects.

In a fifth aspect, the present invention provides a computer readable storage medium comprising one or more program instructions executable by a system as described in the third aspect to implement a method as described in one or more of the first aspects.

The method provided by the embodiment of the invention does not concern the source of the pulse signal generated at the splicing position during the voice synthesis, achieves the effect of voice smoothing after the weighted average of the front and rear section sampling point values, greatly improves the negative influence of the pulse signal on the integral rhythm, tone quality and auditory sense of the voice, and can lighten or avoid the occurrence of noise phenomenon.

Drawings

Fig. 1 is a schematic flow chart of a method for solving the problem of generating a pulse signal at a splicing position in speech synthesis according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a three-stage return process for speech synthesis according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a device for solving the problem of generating a pulse signal at a splicing position in speech synthesis according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a system for solving the problem of generating a pulse signal at a splicing position in speech synthesis according to an embodiment of the present invention.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows a flowchart of a method for solving a problem of generating a pulse signal at a splice in speech synthesis, where the method may be implemented by any apparatus, device, platform, or cluster of devices with computing and processing capabilities. As shown in fig. 1, the method comprises the steps of:

step 10, extracting two voice fragments to be spliced from a database to serve as a first voice fragment and a second voice fragment respectively, and reserving N sampling points in the first voice fragment and the second voice fragment to serve as overlapping parts; wherein N > 256.

Specifically, N sampling points are reserved as overlapping portions of the OverLap when the front and rear voice segments are calculated, wherein the front segment of the voice segments of the front and rear segments is called a first voice segment, and the rear segment is called a second voice segment. The proposed sampling point N > 256.

The last N sample points of the first speech segment are:

samplesout＝(sampout ₁ ，sampout ₂ ，...，sampout _N )

the first N sampling points of the second speech segment are:

samplesin＝(sampin ₁ ，sampin ₂ ，...，sampin _N )

and step 20, calculating a fading-out coefficient vector and a fading-in coefficient vector according to the sampling points.

Specifically, samplesout is a gradient coefficient vector; samplesin is a gradient coefficient vector.

The fading coefficient vector is calculated as follows:

the fading coefficient vector is calculated as follows:

and step 30, obtaining the sampling point value of the overlapped part by using the gradually-out coefficient vector and the gradually-in coefficient vector.

Specifically, according to the sampling point value in the first voice segment and the sampling point value in the second voice segment, and the gradually-out coefficient vector and the gradually-in coefficient vector, the sampling point value of the overlapped part is obtained.

The sample point value of the overlap is calculated as:

samples＝fadeout _coef *samplesout+fadein _coef *samplesin

wherein fadeout _coef Sampling point values in the first voice segment; fadesin _coef Sampling point values in the second speech segment; samplesout is a gradient coefficient vector; samplesin is a gradual inCoefficient vectors.

The problem of generating pulse signals during synthesis can be well solved after the weighted average is carried out on the sampling point values at the splicing position, because the stream processing splicing position is extremely easy to generate pulse signals, although the pulse signals are usually generated by few points, the numerical value is large, and the weighted average process is used for reducing the multiplied result by giving a small weight to the sampling point values although the value is large, so that the problem of noise generation is reduced or avoided.

And step 40, based on the obtained sampling point value of the overlapped part, completing the splicing of the first voice fragment and the second voice fragment.

The method is described in detail below with reference to specific examples:

fig. 2 is a schematic diagram of a three-stage return process of speech synthesis according to an embodiment of the present invention, as shown in fig. 2, where OR represents Right OverLap and OL represents Left OverLap.

OR is a plurality of frame sampling points, the given linear fading coefficient is N, if the number of the sampling points is N, the sampling points of the section are multiplied by the coefficients (N-1)/N, (N-2)/N, (N-3)/N … respectively, and the sampling points of the section are recorded in the memory without returning.

OL is a number of frame sampling points, given a linear progressive coefficient, for example, the number of sampling points is N, and the sampling points of the section are multiplied by coefficients of 1/N,2/N and 3/N … respectively.

Assuming that the total length of the return voice is 500ms and the sampling rate is 22050, the number of sampling points of each segment is 22050×500/1000=11025, and here, assume that the OverLap is n=256+1.

The fade-out coefficient is: (256/257,254/257, …/257,1/257)

The fade-in coefficient is: (1/257,2/257, … 255/257,256/257)

The total number of sampling points returned for the first time is 11025-256, and the OR part is used as the next time to return;

and in the second return, firstly, using the OR of the previous section and the OL of the current time to calculate the value of the OverLap, wherein the calculation method is as follows:

samples＝fadeout _coef *samplesout+fadein _coef *samplesin

the following is a brief description:

let 256 values of OR and OL be r1, r2, … r256 and l1, l2, …, l256, respectively.

The segment computes as: overLap= (r1×256/257+l1×1/257, r2×255/257+l2×2/257, …, r256×1/257+l256×256/257), 256 values in total, the second segment removes the OL & OR segment, leaving the DATA segment in total 11025-256×2, and the above calculated OverLap is returned with the current segment DATA, leaving the OR segment untreated.

And returning all the rest voices for the third time, and similarly to the second time, calculating the OverLap firstly, and simultaneously returning all the rest voices for the third time, so that an OR segment does not need to be cut off, and two ends are combined together to return simultaneously.

The problem that the stream processing splice is very easy to generate pulse signals is solved for various reasons. The method does not care the generated source, achieves the effect of voice smoothing after the weighted average of the front and back section sampling point values, and can greatly improve the occurrence of the phenomenon. Corresponding to the method of the above embodiment, the present invention further provides a device for solving the problem of generating a pulse signal at a splicing position in speech synthesis, as shown in fig. 3, where the device for solving the problem of generating a pulse signal at a splicing position in speech synthesis includes: a database unit 310, a sampling point selection unit 330, a first calculation unit 330, a second calculation unit 340 and a stitching unit 350. Specific:

the database 310 is used for storing voice fragments;

the sampling point selecting unit 330 is configured to reserve N sampling points from the first speech segment and the second speech segment as overlapping portions;

the first calculating unit 330 is configured to calculate a fading-out coefficient vector and a fading-in coefficient vector;

the second calculating unit 340 is configured to obtain a sampling point value of the overlapping portion based on the fading-out coefficient vector and the fading-in coefficient vector;

the splicing unit 350 is configured to finish the splicing of the first speech segment and the second speech segment based on the obtained sampling point value of the overlapping portion.

The functions performed by each component in the device for solving the problem of generating the pulse signal at the splicing position in the speech synthesis provided by the embodiment of the invention are described in detail in the above method, so that redundant description is omitted here.

Corresponding to the above embodiment, the present invention further provides a system for solving the problem of generating a pulse signal at a splice in speech synthesis, as shown in fig. 4, where the system includes at least one processor 410 and a memory 420;

a memory 410 for storing one or more program instructions;

the processor 420 is configured to execute one or more program instructions to perform any of the method steps described in the above embodiments for a method for solving a problem of generating a pulse signal at a splice in speech synthesis.

Corresponding to the above embodiment, the embodiment of the present invention further provides a chip, which is coupled to the memory in the above system, so that the chip invokes the program instructions stored in the memory when running, to implement the method for solving the problem of generating a pulse signal at the splice in speech synthesis as described in the above embodiment.

Corresponding to the above embodiments, the present invention further provides a computer storage medium, where the computer storage medium includes one or more programs, where one or more program instructions are configured to execute the method for solving the problem of generating a pulse signal at a splice in speech synthesis as described above by a system for solving the problem of generating a pulse signal at a splice in speech synthesis.

The method provided by the embodiment of the invention does not concern the source of the pulse signal generated at the splicing position during the voice synthesis, carries out weighted average on the values of the front and rear sampling points, and aims at the situation that the pulse signal value is large and is given with small weight, the value after the product becomes smaller, thereby achieving the effect of voice smoothing, reducing or avoiding the occurrence of noise phenomenon and greatly improving the negative influence of the pulse signal on the overall rhythm, tone quality and hearing of voice.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the invention is not limited to the particular embodiments disclosed, but is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the invention.

Claims

1. A method for solving the problem of generating a pulse signal at a splice in speech synthesis, the method comprising the steps of:

based on the obtained sampling point value of the overlapped part, the splicing of the first voice fragment and the second voice fragment is completed;

the fading coefficient vector is calculated as follows:

the fade-in coefficient vector is calculated as follows:

the sample point value of the overlap is calculated as:

samples＝fadeout _coef *samplesout+fadein _coef *samplesin

wherein fadeout _coef Is a gradient coefficient vector; fadesin _coef Is a progressive coefficient vector;

samplesout is a sampling point value in the first voice segment; samplesin is the sample point value in the second speech segment.

2. The method according to claim 1, wherein the sample point values of the overlapping portion are obtained by using the fade-out coefficient vector and the fade-in coefficient vector, specifically:

3. The device for solving the problem of generating pulse signals at the splicing position in the voice synthesis is characterized by comprising a database unit, a sampling point selection unit, a first calculation unit, a second calculation unit and a splicing unit;

the database is used for storing voice fragments;

the splicing unit is used for completing the splicing of the first voice fragment and the second voice fragment based on the obtained sampling point value of the overlapped part;

the fading coefficient vector is calculated as follows:

the fade-in coefficient vector is calculated as follows:

the sample point value of the overlap is calculated as:

samples＝fadeout _coef *samplesout+fadein _coef *samplesin

4. The system for solving the problem of generating pulse signals at the splicing position in the voice synthesis is characterized by comprising at least one processor and a memory;

the memory is used for storing one or more program instructions;

the processor is configured to execute one or more program instructions for performing the method of claim 1 or 2.

5. A chip, characterized in that the chip is coupled to a memory in a system such that the chip, when running, invokes program instructions stored in the memory, implementing the method according to claim 1 or 2.

6. A computer readable storage medium comprising one or more program instructions executable by the system of claim 4 to implement the method of claim 1 or 2.