CN112102807A

CN112102807A - Speech synthesis method, apparatus, computer device and storage medium

Info

Publication number: CN112102807A
Application number: CN202010824547.9A
Authority: CN
Inventors: 沈传科; 赵凯; 王福海; 张文锋
Original assignee: Merchants Union Consumer Finance Co Ltd
Current assignee: Merchants Union Consumer Finance Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-12-18

Abstract

The application relates to a speech synthesis method, a speech synthesis apparatus, a computer device and a storage medium. The method comprises the following steps: splitting a voice text to be synthesized to obtain a plurality of split voice texts; classifying the split voice texts to obtain a first voice text and a second voice text; the first voice text is a voice text containing user information; the second speech text is represented as speech text which does not include user information; acquiring a first voice corresponding to the first voice text, and acquiring a second voice corresponding to the second voice text; the voice parameters of the first voice correspond to the voice parameters of the second voice; and splicing the first voice and the second voice to obtain the synthesized voice of the voice text to be synthesized. The method ensures that the synthesized voice obtained after splicing keeps uniform tone, volume and tone through the correspondence of the voice parameters of the first voice and the second voice, realizes smooth splicing and greatly reduces the abrupt feeling of the voice at the splicing point.

Description

Speech synthesis method, apparatus, computer device and storage medium

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, computer device, and storage medium.

Background

Intelligent Speech Interaction (Intelligent Speech Interaction) is based on technologies such as Speech recognition, Speech synthesis and natural language understanding, and gives Intelligent man-machine Interaction experience of 'being able to listen, speak and understand you' to a product. In an intelligent voice interaction scene, the traditional voice synthesis method realizes voice recording in a mode of splicing after manual pre-recording.

However, in the method of manual pre-recording, it is difficult to make the manual recording uniform in emotion, tone, volume, etc., which easily causes the spliced voice to have a very obvious obtrusive feeling at the splicing point.

Therefore, the speech obtained by the traditional speech synthesis method has the problem that the speech is difficult to meet the requirements in the aspects of emotion, tone, volume and the like.

Disclosure of Invention

In view of the above, it is necessary to provide a speech synthesis method, an apparatus, a computer device, and a storage medium for solving the technical problem that speech obtained by the above speech synthesis method is difficult to satisfy requirements in terms of emotion, speech, volume, and the like.

A method of speech synthesis, the method comprising:

splitting a voice text to be synthesized to obtain a plurality of split voice texts;

classifying the split voice texts to obtain a first voice text and a second voice text; the first voice text is a voice text containing user information; the second speech text is represented as speech text which does not include user information;

acquiring a first voice corresponding to the first voice text, and acquiring a second voice corresponding to the second voice text; the first voice corresponds to voice parameters of the second voice;

and splicing the first voice and the second voice to obtain the synthesized voice of the voice text to be synthesized.

In one embodiment, before splitting the speech text to be synthesized, the method further includes:

acquiring a first response voice of a voice call object;

recognizing the first answering voice through a pre-trained voice recognition model to obtain voice text information of the first answering voice;

determining a second response voice text aiming at the first response voice according to the voice text information;

and replacing target information included in the second response voice text with the user information to obtain the voice text to be synthesized.

In one embodiment, the splitting the speech text to be synthesized to obtain a plurality of split speech texts includes:

acquiring a voice text splitting position in the voice text to be synthesized; the voice text splitting position is determined according to the punctuation position information in the voice text to be synthesized;

and splitting the voice text to be synthesized according to the voice text splitting position to obtain a plurality of split voice texts of the voice text to be synthesized.

In one embodiment, the obtaining of the first voice corresponding to the first voice text includes:

matching the first voice text with a voice text of preset voice in a preset voice cache;

if the first voice text is matched with the voice text of the preset voice in the preset voice cache, taking the voice of the voice text matched with the first voice text as the first voice;

if the first voice text is not matched with the voice text of the preset voice in the preset voice cache, performing voice synthesis processing on the first voice text through a pre-trained voice synthesis model to obtain the first voice; and the voice synthesis model is obtained by training according to the voice sample corresponding to the voice parameter of the second voice.

In one embodiment, the obtaining of the second speech corresponding to the second speech text includes:

matching the second voice text with the voice text of the preset voice in the preset voice cache;

and if the second voice text is matched with the voice text of the preset voice in the preset voice cache, taking the voice of the voice text matched with the second voice text as the second voice.

In one embodiment, after performing speech synthesis processing on the first speech text through a pre-trained speech synthesis model to obtain the first speech, the method further includes:

storing the first voice into the voice cache;

and when the obtained next voice text is the same as the voice text of the first voice, obtaining the first voice from the voice cache as the voice corresponding to the next voice text.

In one embodiment, the determining, according to the speech text information, a second answer speech text for the first answer speech includes:

according to the voice text information, determining intention information of the voice call object;

and acquiring a corresponding response voice text from a voice text database corresponding to a pre-constructed voice flow tree according to the intention information, wherein the response voice text is used as the second response voice text.

A speech synthesis apparatus, the apparatus comprising:

the voice text splitting module is used for splitting the voice text to be synthesized to obtain a plurality of split voice texts;

the voice text classification module is used for classifying the split voice text to obtain a first voice text and a second voice text; the first voice text is a voice text containing user information; the second speech text is represented as speech text which does not include user information;

the voice acquisition module is used for acquiring a first voice corresponding to the first voice text and acquiring a second voice corresponding to the second voice text; the first voice corresponds to voice parameters of the second voice;

and the voice splicing module is used for splicing the first voice and the second voice to obtain the synthesized voice of the voice text to be synthesized.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the voice synthesis method, the device, the computer equipment and the storage medium, the voice text to be synthesized is split to obtain a plurality of split voice texts, then the split voice texts are divided into the first voice text comprising the user information and the second voice text not comprising the user information, after the first voice corresponding to the first voice text and the second voice corresponding to the second voice text are obtained respectively, the first voice and the second voice are spliced to obtain the synthesized voice of the voice text to be synthesized, and by adopting the method, on the basis of realizing voice synthesis, the synthesized voice obtained after splicing keeps uniform in tone, volume and tone through correspondence of the voice parameters of the first voice and the second voice, so that smooth splicing is realized, and the abrupt feeling of the voice at the splicing point is greatly reduced.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a speech synthesis method;

FIG. 2 is a flow diagram illustrating a method for speech synthesis in one embodiment;

FIG. 3 is a flowchart illustrating the steps of obtaining a text of speech to be synthesized in one embodiment;

FIG. 4 is a flow diagram of adaptive maintenance of a pre-synthesis dialogs library in one embodiment;

FIG. 5 is a diagram of a scheme for implementing speech synthesis according to one embodiment;

FIG. 6 is a schematic flow diagram of an example presynthesis scheme;

FIG. 7 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The speech synthesis method provided by the application can be applied to the application environment shown in fig. 1. The server 104 may be disposed in the voice interactive robot 106, or may communicate with the voice interactive robot 106 through a network. The voice call object 102 can perform voice interaction with the voice interaction robot 106, wherein the server 104 can be implemented by an independent server or a server cluster composed of a plurality of servers. In a speech synthesis application scene, the server 104 splits a speech text to be synthesized to obtain a plurality of split speech texts; classifying the split voice texts to obtain a first voice text and a second voice text; the first voice text is a voice text containing user information; the second speech text is represented as speech text which does not include user information; acquiring a first voice corresponding to the first voice text, and acquiring a second voice corresponding to the second voice text; the voice parameters of the first voice correspond to the voice parameters of the second voice; and splicing the first voice and the second voice to obtain synthesized voice of the voice text to be synthesized, and sending the synthesized voice to the voice interaction robot 106 so that the voice interaction robot adopts the synthesized voice to have a conversation with the voice call object 102.

In one embodiment, as shown in fig. 2, a speech synthesis method is provided, which is exemplified by the application of the method to the server 104 in fig. 1, and includes the following steps:

step S202, splitting the voice text to be synthesized to obtain a plurality of split voice texts.

The to-be-synthesized speech text represents the speech text that the server 104 obtains from the speech text database and is used for performing the speech dialogue with the speech call object.

In a specific implementation, after the server 104 obtains the voice text to be synthesized, the splitting position of the voice text to be synthesized may be determined first, and the voice text to be synthesized is split into a plurality of voice texts according to the splitting position, so as to obtain a plurality of split voice texts.

For example, the speech text to be synthesized is "this is the chinese banking credit center, do you know mr. king? We have important services to reach him. "then, the text of the speech to be synthesized can be split to obtain" this is the credit center of the chinese bank "," do you know mr. of king? "and" we have important services to contact him. "three split phonetic texts.

Step S204, classifying the split voice texts to obtain a first voice text and a second voice text; the first voice text is a voice text containing user information; the second speech text is represented as speech text that does not include user information.

Wherein the first speech text includes user information, for example, the above speech text "do you know mr. wang? "Mr. Wang in the text includes the surname and gender of the user, belong to the user information, then the phonetic text is the first phonetic text.

Wherein the second speech text does not comprise user information, e.g. the above speech text "we have important services to contact him. And if the user information is not contained in the text, the voice text is the second voice text.

The user information indicates user's own information or information related to the user, for example, user name, age, transaction amount, transaction time, payment time, or transaction of a certain service.

In a specific implementation, after obtaining the split voice text, the server 104 may classify the split voice text according to a criterion of whether the voice text contains the user information, so as to obtain a first voice text including the user information and a second voice text not including the user information.

For example, the text of the speech to be synthesized is "this is the credit center of chinese bank, do you know mr. king? We have important services to reach him. "split up to get the first phonetic text" do you know Mr. King? "the second phonetic text" this side is the Chinese banking credit center, "and" we have important business to contact him. "

Step S206, acquiring a first voice corresponding to the first voice text, and acquiring a second voice corresponding to the second voice text; the first speech corresponds to speech parameters of the second speech.

The voice parameters represent characteristic attributes of sound, including volume, tone, and timbre.

In a specific implementation, after obtaining the first voice text and the second voice text, the server 104 may match the first voice text and the second voice text with the voice text of the preset voice in the preset voice cache, and if the voice text matched with the first voice text and the second voice text exists in the voice cache, the server may directly obtain the voice of the voice text matched with the first voice text from the voice cache as the first voice and obtain the voice of the voice text matched with the second voice text as the second voice. If the voice cache does not have the voice text matched with the first voice text, the voice synthesis can be carried out on the unmatched first voice text in real time in the conversation process with the voice call object to obtain the first voice.

And S208, splicing the first voice and the second voice to obtain the synthesized voice of the voice text to be synthesized.

In specific implementation, after the server 104 obtains the first voice and the second voice, the first voice and the second voice may be spliced, and since the first voice and the second voice may be respectively multiple, when the first voice and the second voice are spliced, the first voice and the second voice need to be spliced according to a set sequence, and finally, a synthesized voice of the voice text to be synthesized is obtained. For example, the first speech text obtained by splitting the speech text to be synthesized "do you know mr. wang? "and second phonetic text" this side is the chinese banking credit center, "and" we have important business to contact him. "when performing the splicing process, the" side is the credit center of the chinese bank, "-" do you know mr. king? "-" we have important services to contact him. "to obtain the speech text to be synthesized" this side is the credit center of the chinese bank, do you know mr. king? We have important services to reach him. "is synthesized.

According to the voice synthesis method, the voice text to be synthesized is split firstly to obtain a plurality of split voice texts, then the split voice texts are divided into a first voice text comprising user information and a second voice text not comprising the user information, after a first voice corresponding to the first voice text and a second voice corresponding to the second voice text are obtained respectively, the first voice and the second voice are spliced to obtain a synthesized voice of the voice text to be synthesized, and by adopting the method, on the basis of realizing voice synthesis, the synthesized voice obtained after splicing keeps uniform in tone, volume and tone through correspondence of voice parameters of the first voice and the second voice, smooth splicing is realized, and the abrupt feeling of voice at a splicing point is greatly reduced.

In one embodiment, as shown in fig. 3, before the step S202, the following steps are further included:

step S302, a first response voice of the voice call object is obtained.

Step S304, the first answering voice is recognized through a pre-trained voice recognition model, and voice text information of the first answering voice is obtained.

Step S306, determining a second response voice text aiming at the first response voice according to the voice text information;

and step S308, replacing the target information included in the second response voice text with the user information to obtain the voice text to be synthesized.

The first response voice can represent response voice of the voice call object responding according to preset call voice played by the voice interaction robot.

The voice text information may represent information obtained by converting the voice content of the first response voice into a text.

The target information represents text information which needs to be modified or replaced aiming at different voice call objects in the second response voice text. For example, for the second response phonetic text "ask you [ userName ] for oneself? The [ userName ] in the "is the target information in the voice text.

In a specific implementation, after obtaining the first response voice of the voice call object, the server 104 may first convert the first response voice into a text through a voice recognition model obtained through pre-training, to obtain voice text information of the first response voice, determine intention information of the voice call object according to the voice text information, determine a second response voice text for the first response voice according to the intention information, further determine target information to be replaced in the second response voice text, and replace the target information in the second response voice text with information of the voice call object (i.e., user information) to obtain a voice text to be synthesized.

For example, if the obtained second response speech text is "ask you for mr. from [ userName ]? "replace the destination information [ userName ] with the information of the voice call partner," ask you for your identity of mr. a? "as the speech text to be synthesized.

In this embodiment, the voice text information of the first answer voice is obtained by identifying the first answer voice of the voice call object, the second answer voice text for the first answer voice is determined according to the voice text information, and the target information in the second answer voice text is replaced by the user information of the voice call object, so as to obtain the voice text to be synthesized for the voice call object, so that the voice text to be synthesized is further split, synthesized and spliced, and finally, the synthesized voice is obtained, thereby realizing intelligent interaction with the voice call object.

In an embodiment, the step S202 specifically includes: acquiring a voice text splitting position in a voice text to be synthesized; the splitting position of the voice text is determined according to the punctuation position information in the voice text to be synthesized; and splitting the voice text to be synthesized according to the splitting position of the voice text to obtain a plurality of split voice texts of the voice text to be synthesized.

The voice text splitting position represents a position where the voice text to be synthesized is split into a plurality of voice texts, and the voice text to be synthesized can have one or more voice text splitting positions.

In a specific implementation, after obtaining the voice text to be synthesized, the server 104 may determine one or more splitting positions according to the punctuation position information in the voice text to be synthesized, and split the voice text to be synthesized according to the determined splitting positions of the voice text to be synthesized, so as to obtain a plurality of split voice texts of the voice text to be synthesized. For example, for the speech text to be synthesized "this is the chinese banking credit centre, do you know mr. king? We have important services to reach him. "splitting the to-be-synthesized voice text according to the punctuation location information, so as to obtain three split voice texts," do you know Mr. King? "this side is the Chinese banking Credit center," and "we have important business to contact him. ".

It can be understood that when the splitting position of the voice text is determined, each punctuation position can be used as the splitting position of the voice text, so that each split voice text only comprises one punctuation information, and also one splitting position of the voice text can be determined every other one or more punctuation positions, so that each split voice text comprises a plurality of punctuation information, thereby reducing the splicing points and smoothing the obtrusive feeling of synthesized voice.

Further, in an embodiment, the voice text splitting position may be determined according to position information of the voice text including the user information in the voice text to be synthesized and punctuation position information of the voice text to be synthesized.

Specifically, the position information of the voice text including the user information in the voice text to be synthesized may be determined, a position of a landmark closest to the position (which is equivalent to a position extending from the position of the voice text including the user information to both sides of the voice text to be synthesized to one or two landmark positions closest to the position) may be found from the position of the voice text including the user information, the found position of the landmark is used as a voice text splitting position, and the voice text to be synthesized is split, so that the voice text including the user information and the voice text not including the user information are obtained.

For example, for the speech text to be synthesized, "this time is given to you, continues to be overdue, further affects your credit and generates a penalty, advising you to manually repay by recruiting APP (application) at the latest [ rpyTime ]? "where the user information [ rpyTime ] is included, then the punctuation position closest to the [ rpyTime ] position is searched from the [ rpyTime ] position as the voice text splitting position, and the voice text" suggest manually repayment by recruiting APP before your latest [ rpyTime ] including the user information "and the voice text" not including the user information "are given to you less than that, continue to expire, further affect your information and generate a penalty," and "can you? ", three split voice texts are obtained.

In this embodiment, the splitting position of the voice text is determined by the position information of the landmark in the voice text to be synthesized, and the voice text to be synthesized is split according to the splitting position of the voice text, so that the number of voice splitting points can be reduced, and the defects that the obtained split text is short and the synthesis effect is poor due to too many splitting positions are overcome. Furthermore, the voice text position including the user information is determined, the voice text splitting position is determined according to the voice text position, the voice text to be synthesized is split, and after the voice text including the user information is obtained, the voice text positions not including the user information are respectively changed into the voice texts, so that the determining speed of the voice text splitting position is improved, the number of the split voice texts is reduced, and the voice synthesis time can be saved.

In an embodiment, the obtaining of the first voice corresponding to the first voice text in step S206 specifically includes: matching the first voice text with the voice text of the preset voice in the preset voice cache; if the first voice text is matched with the voice text of the preset voice in the preset voice cache, taking the voice of the voice text matched with the first voice text as the first voice; if the first voice text is not matched with the voice text of the preset voice in the preset voice cache, performing voice synthesis processing on the first voice text through a pre-trained voice synthesis model to obtain a first voice; the speech synthesis model is obtained by training the speech samples corresponding to the speech parameters of the second speech.

Where speech synthesis represents the process of converting speech text to speech.

In a specific implementation, the server 104 splits the to-be-synthesized voice text to obtain a first voice text, and when obtaining a first voice of the first voice text, the first voice text may be first matched with a voice text of a preset voice in a preset voice cache, and if the first voice text is matched with the voice text of the preset voice in the preset voice cache, the voice of the voice text matched with the first voice text may be directly obtained as the first voice. On the contrary, if the first voice Text is not matched with the voice Text of the preset voice in the preset voice cache, a TTS (Text To Speech, Text To voice) cloud service needs To be called in real time To synthesize the first voice, and the first voice Text can be subjected To voice synthesis processing through a voice corresponding To the voice parameter of the second voice and a pre-trained voice synthesis model To obtain the first voice.

Further, after performing speech synthesis processing on the first speech text through a pre-trained speech synthesis model to obtain a first speech, the method further includes: storing the first voice into a voice cache; and when the acquired next voice text is the same as the voice text of the first voice, acquiring the first voice from the voice cache as the voice corresponding to the next voice text.

It should be noted that the next speech text in this embodiment refers to a speech text at any subsequent time, that is, as long as the obtained speech text is the same as the speech text of the first speech, the first speech can be directly acquired from the speech buffer and used as the synthesized speech of the speech text.

In concrete implementation, if the first voice is obtained through the voice synthesis model, the obtained first voice can be stored in the voice cache for summarizing so as to be used in subsequent voice interaction, and therefore time consumed by performing voice synthesis in real time in each interaction is reduced.

In this embodiment, the first voice text is matched with the voice text of the preset voice in the preset voice cache, so that when the synthesized recording matched with the first voice text exists in the preset voice cache, the matched synthesized voice can be directly acquired, time consumed for voice synthesis in the voice interaction process is reduced, and the interaction experience of a user is improved. And when no synthesized record matched with the first voice text exists in the preset voice cache, the first voice text is synthesized by adopting the voice training voice synthesis model corresponding to the voice parameters of the second voice, so that the voice parameters of the first voice correspond to the voice parameters of the second voice, the first voice and the second voice are smoothly spliced, the sharp feeling of splicing is reduced, the personification degree of the voice can be improved, and the problem of poor voice emotion and personification degree only synthesized by adopting the voice synthesis model is solved.

Further, in an embodiment, the method further includes maintaining a dialog library of pre-synthesized speech (first speech synthesized by invoking a TTS service in advance), as shown in fig. 4, which is an adaptive maintenance flow chart of the pre-synthesized dialog library, and the implementation process thereof mainly includes the following steps:

(1) and scanning the strategy, wherein the strategy represents the outbound call parameters including the parameters of validity, dialing time, a speech version, tone and the like, and the scanning strategy is to traverse the historical outbound call parameters.

(2) And screening a plurality of first call voices which meet set conditions (namely are effective and belong to TTS timbre) from the historical call voice records according to the outbound call parameters, and acquiring the content of each first call voice (namely acquiring voice text information).

(3) And inquiring a historical call voice record table according to the acquired call content of the first call voice, counting the use frequency of each first call voice, and sequencing each first call voice meeting the set conditions according to the use frequency to obtain a first call voice sequence.

(4) Traversing the first communication voice sequence according to the sequence of the use frequency from high to low, screening out second communication voice which contains a variable (namely user information) and can not be enumerated from the first communication voice sequence, obtaining node information of the second communication voice (namely conversation), searching for the communication voice of all previous nodes of the node where the second communication voice is located from a voice flow tree, obtaining the mapping relation between the second communication voice and the communication voice of all the previous nodes corresponding to the second communication voice, correspondingly inputting the communication voice of the second communication voice and the previous node corresponding to the second communication voice into a pre-synthesis conversation configuration table according to the mapping relation, and realizing the configuration of the pre-synthesis conversation.

(5) And judging whether the number of the conversation voice of the last node corresponding to the second communication voice configured in the configuration table reaches the upper limit of the number of the configuration pieces (namely whether the number of the conversation voice of the last node reaches the threshold), and if so, adjusting the sequence of the conversation (namely the conversation voice of the second communication voice and the corresponding last node) in the configuration table according to the voice flow tree.

In this embodiment, by configuring the most likely required speech buffer (i.e., the linguistic configuration table in step (5)) based on experience during cold start, the adaptive algorithm is used to dynamically maintain the speech buffer after the system runs for a period of time, so that the more times the speech synthesis is performed, the lower the average synthesis time.

In an embodiment, the obtaining of the second voice corresponding to the second voice text in step S206 specifically includes: matching the second voice text with the voice text of the preset voice in the preset voice cache; and if the second voice text is matched with the voice text of the preset voice in the preset voice cache, taking the voice of the voice text matched with the second voice text as the second voice.

In specific implementation, the second voice text without the user information can be prerecorded to obtain a second voice, and the second voice is stored in a preset voice cache. After the server 104 splits the to-be-synthesized voice text to obtain the second voice text, the second voice text may be directly matched with the voice text of the preset voice in the preset voice cache, and the voice of the voice text matched with the second voice text in the preset voice cache is directly obtained as the second voice.

In this embodiment, the second speech text is matched with the speech text of the preset speech in the preset speech cache, so that when the synthesized recording matched with the second speech text exists in the preset speech cache, the matched synthesized speech can be directly acquired, time consumed for speech synthesis in the speech interaction process is reduced, and the interaction experience of the user is improved.

In an embodiment, the step S306 specifically includes: according to the voice text information, determining intention information of the voice call object; and acquiring a corresponding response voice text from a voice text database corresponding to a pre-constructed voice flow tree according to the intention information, wherein the response voice text is used as a second response voice text.

Wherein the intention information may indicate an intention of the voice call partner to respond.

In the specific implementation, a voice flow tree for voice interaction with the voice call objects can be pre-constructed, the voice flow tree can be constructed according to the historical call voice of each voice call object, and the historical call voice can be converted into a voice text and stored in a voice text database. After the voice text information of the first answering voice of the voice call object is obtained through the voice recognition model, the voice text information can be input into the semantic recognition model, the intention information of the voice call object is determined according to the output result of the voice recognition model, and further the answering voice text aiming at the first answering voice can be obtained from the voice text database corresponding to the voice flow tree according to the intention information of the voice call object and is used as the second answering voice text.

In this embodiment, the intention information of the voice call object is determined according to the voice text information of the first response voice, and the response voice text for the first response voice is determined according to the intention information and is used as the second response voice text, so that the voice text to be synthesized is determined according to the second response voice text.

In order to clarify more clearly the technical solution provided by the embodiment of the present application, the following will explain the solution with reference to fig. 5 to 6, where fig. 5 is a schematic diagram of a solution for implementing speech synthesis in an application example, and a specific flow is as follows:

1. in the AI (artificial intelligence) interaction process, after obtaining a speech text to be synthesized (i.e., the dialect in fig. 5), the speech interactive robot splits the speech text including the user information in the speech text to be synthesized into a new speech text (i.e., "does you know mr. name" in fig. 5) after extending to the nearest punctuation points on both sides, and the parts not including the user information become one speech text (i.e., "this side is xx bank credit center" in fig. 5, and "we have important business to contact him"), thereby obtaining a plurality of split speech texts.

In the step, the splitting position during splitting is transferred to the mark point, so that on one hand, the splicing points are reduced, and on the other hand, the splicing feeling at the splicing position is smooth.

2. And (2) recording audio in advance by a voice recorder to train to obtain a voice synthesis model (TTS model), manually recording the voice text which does not comprise the user information after being split in the step (1) in advance by the voice recorder, and requesting the voice synthesis model to synthesize the voice in real time in the voice interaction process of the voice text comprising the user information.

In the step, the voice text without the user information adopts real artificial sound, so that the overall personification of the synthesized sound is improved, and the voice text with the user information is subjected to voice synthesis through a voice synthesis model obtained by voice training of the same voice recording personnel, so that the difference among all spliced recording segments is smoothed, and the problem that the variable of pre-recording (namely, pre-recorded voice) cannot be enumerated is solved.

3. And splicing the obtained voice containing the user information and the voice not containing the user information to obtain complete synthesized voice for the voice interaction robot to use.

The customized voice synthesis model + optimal splitting + semi-synthetic voice synthesis scheme that this application provided makes interactive voice text of waiting to synthesize the optimal splitting earlier, and the part that will wait to synthesize the voice text and include user information extends to the nearest punctuation on both sides and tears out, adopts the synthetic recording of voice synthesis model, has overcome the poor defect of short text synthesis effect, and has reduced the splice point, has overcome the problem that the prerecording is difficult to enumerate the variable simultaneously. The split voice text without the user information adopts pre-recording, and voice synthesis is carried out on the voice text with the user information by combining a voice synthesis model trained based on the voice of the same voice recorder, so that smooth splicing is realized, the obtrusive feeling is greatly reduced, and the personification degree of the synthesized voice is improved.

Meanwhile, on the basis of the above method, in order to reduce the time consumed by single voice synthesis, the present application also proposes to cache the hotspot recording by using a caching technology, and when the voice outbound call is connected, an asynchronous two-stage pre-synthesis scheme (i.e. a thread pool one and a thread pool two in fig. 6) is used, where fig. 6 is a schematic flow diagram of the pre-synthesis scheme, and mainly includes the following steps:

(1) when the voice interaction robot calls out and enters the interaction process, an asynchronous thread pool, namely a thread pool I and a thread pool II in fig. 6, is started outside the main interaction thread, wherein the thread pool I is used for preprocessing the voice text to be synthesized, and the thread pool II is used for processing voice synthesis.

(2) In the voice interaction process, when a node where the voice interaction robot plays voice is located, a conversational group to be pre-synthesized (i.e., a conversational group (equivalent to a preset voice text) obtained from a conversational library (equivalent to the voice text database) corresponding to the flow tree of fig. 4) is obtained, and a variable in a case is analyzed, and a variable value (i.e., information of a call object is obtained).

(3) And traversing the dialoging group, replacing variables in the dialoging with corresponding variable values (namely determining target information to be replaced in the voice text, and replacing the target information in the voice text with information of an outbound object (namely user information)) to obtain the voice text to be synthesized.

(4) The method comprises the steps of splitting a grammar (namely a voice text to be synthesized) according to a set splitting mode, obtaining a part containing variables as synthesized lemmas (namely a first voice text containing user information), and synthesizing first voice and second voice.

(5) And storing the first voice and the second voice into a voice library for the voice interaction robot to use in a main interaction thread of a later interaction turn.

(6) And (5) repeating the operations from the step (2) to the step (5) when the voice interaction robot is positioned at a node for playing the voice, so as to synthesize the voice of the next round in advance by one round. The voices (namely, the voices of the first voice and the second voice) of the main interactive thread of the current round are preferentially acquired from the voice library, and the pre-synthesized recording in the asynchronous thread pool of the previous round is used at the moment, so that the time consumption of voice synthesis of the main interactive thread is reduced.

It should be understood that although the steps in the flowcharts of fig. 2-4, 6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4, 6 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided a speech synthesis apparatus including: a speech text splitting module 702, a speech text classification module 704, a speech acquisition module 706, and a speech concatenation module 708, wherein:

the speech text splitting module 702 is configured to split the speech text to be synthesized to obtain a plurality of split speech texts.

The voice text classification module 704 is configured to classify the split voice text to obtain a first voice text and a second voice text; the first voice text is a voice text containing user information; the second speech text is represented as speech text that does not include user information.

A voice obtaining module 706, configured to obtain a first voice corresponding to the first voice text, and obtain a second voice corresponding to the second voice text; the first speech corresponds to speech parameters of the second speech.

The voice splicing module 708 is configured to splice the first voice and the second voice to obtain a synthesized voice of the voice text to be synthesized.

In one embodiment, the above apparatus further comprises:

the first answering voice acquisition module is used for acquiring first answering voice of a voice call object;

the text information recognition module is used for recognizing the first answering voice through a pre-trained voice recognition model to obtain voice text information of the first answering voice;

the second response voice text determination module is used for determining a second response voice text aiming at the first response voice according to the voice text information;

and the information replacement module is used for replacing the target information included in the second response voice text with the user information to obtain the voice text to be synthesized.

In an embodiment, the voice text splitting module 702 is specifically configured to obtain a voice text splitting position in a voice text to be synthesized; the splitting position of the voice text is determined according to the punctuation position information in the voice text to be synthesized; and splitting the voice text to be synthesized according to the splitting position of the voice text to obtain a plurality of split voice texts of the voice text to be synthesized.

In an embodiment, the voice obtaining module 706 is specifically configured to match the first voice text with a voice text of a preset voice in a preset voice cache; if the first voice text is matched with the voice text of the preset voice in the preset voice cache, taking the voice of the voice text matched with the first voice text as the first voice; if the first voice text is not matched with the voice text of the preset voice in the preset voice cache, performing voice synthesis processing on the first voice text through a pre-trained voice synthesis model to obtain a first voice; the speech synthesis model is obtained by training the speech samples corresponding to the speech parameters of the second speech.

In an embodiment, the voice obtaining module 706 is further configured to match the second voice text with a voice text of a preset voice in a preset voice cache; and if the second voice text is matched with the voice text of the preset voice in the preset voice cache, taking the voice of the voice text matched with the second voice text as the second voice.

In one embodiment, the above apparatus further comprises:

the voice storage module is used for storing the first voice into a voice cache;

and the voice determining module is used for acquiring the first voice from the voice cache as the voice corresponding to the next voice text when the acquired next voice text is the same as the voice text of the first voice.

In an embodiment, the second answer speech text determining module is specifically configured to determine intention information of the speech call object according to the speech text information; and acquiring a corresponding response voice text from a voice text database corresponding to a pre-constructed voice flow tree according to the intention information, wherein the response voice text is used as a second response voice text.

It should be noted that the speech synthesis apparatus of the present application corresponds to the speech synthesis method of the present application one to one, and the technical features and the advantages thereof described in the embodiments of the speech synthesis method are all applicable to the embodiments of the speech synthesis apparatus, and specific contents may refer to the description in the embodiments of the speech synthesis method, which is not repeated herein, and thus, the description is hereby made.

In addition, all or part of the modules in the speech synthesis apparatus may be implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data generated during the implementation of the speech synthesis method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech synthesis method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, further comprising, prior to splitting the speech text to be synthesized:

acquiring a first response voice of a voice call object;

3. The method of claim 1, wherein splitting the speech text to be synthesized to obtain a plurality of split speech texts comprises:

4. The method of claim 1, wherein the obtaining the first speech corresponding to the first speech text comprises:

5. The method of claim 1, wherein the obtaining of the second speech corresponding to the second speech text comprises:

6. The method of claim 4, wherein after performing a speech synthesis process on the first speech text through a pre-trained speech synthesis model to obtain the first speech, the method further comprises:

storing the first voice into the voice cache;

7. The method of claim 2, wherein determining a second response speech text for the first response speech from the speech text information comprises:

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.