CN111599341A

CN111599341A - Method and apparatus for generating speech

Info

Publication number: CN111599341A
Application number: CN202010401740.1A
Authority: CN
Inventors: 官山山; 刘晓丰; 唐涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-08-28
Anticipated expiration: 2040-05-13
Also published as: CN111599341B

Abstract

The application discloses a method and a device for generating voice, and relates to the technical field of cloud computing. The specific implementation mode comprises the following steps: obtaining a dialect for a user voice, wherein the dialect comprises a marked target character string; segmenting the word operation into a plurality of sub word operations based on the position of the target character string in the word operation, wherein the plurality of sub word operations comprise the target sub word operation corresponding to the target character string and other sub word operations; searching the recording information corresponding to at least one sub-dialect in the plurality of sub-dialects in the sub-dialect recording information set; and generating voice for replying the voice of the user based on the searched recording information. According to the scheme, the recording information corresponding to the sub-phone technology is searched through segmentation, so that the whole phone technology does not need to be synthesized in real time, and the efficiency of voice interaction is improved.

Description

Method and apparatus for generating speech

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of voice, and particularly relates to a method and a device for generating voice.

Background

The application of a Text To Speech (TTS) technology is more and more widespread, and in particular, Speech synthesis can be applied To a human-computer interaction scene. For example, the electronic device may be engaged in a call with a user. In the conversation process, the electronic equipment can acquire the voice of a user, convert the voice into characters, perform natural voice processing on the characters, and then generate reply voice based on the processing result.

In the related art, as the interactive language becomes more complex, that is, the number of text characters increases, the time consumption of the speech synthesis engine also increases, which may cause the user to speak the speech and the machine device to take a while to make feedback.

Disclosure of Invention

A method, an apparatus, an electronic device, and a storage medium for generating speech are provided.

According to a first aspect, there is provided a method for generating speech, comprising: obtaining a dialect for a user voice, wherein the dialect comprises a marked target character string; segmenting the dialect into a plurality of sub-dialects based on the position of the target character string in the dialect, wherein the plurality of sub-dialects comprise the target sub-dialect corresponding to the target character string and other sub-dialects; searching the recording information corresponding to at least one sub-phone operation in the plurality of sub-phone operations in the sub-phone operation recording information set; and generating voice for replying the voice of the user based on the searched recording information.

According to a second aspect, there is provided an apparatus for generating speech, comprising: an acquisition unit configured to acquire a dialect for a user voice, wherein the dialect includes a tagged target character string; the segmentation unit is configured to segment the conversation into a plurality of sub-conversations based on the position of the target character string in the conversation, wherein the plurality of sub-conversations comprise the target sub-conversation corresponding to the target character string and other sub-conversations; the searching unit is configured to search the recording information corresponding to at least one sub-phone operation in the plurality of sub-phone operations in the sub-phone operation recording information set; and a generating unit configured to generate a voice for replying to the user voice based on the found recording information.

According to a third aspect, there is provided an electronic device comprising: one or more processors; a storage device to store one or more programs that, when executed by one or more processors, cause the one or more processors to implement a method as in any embodiment of a method for generating speech.

According to a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method as any one of the embodiments of the method for generating speech.

According to the scheme of the application, the recording information corresponding to the sub-dialogues is searched through segmentation, so that the whole dialogues do not need to be synthesized, and the efficiency of voice interaction is improved. In addition, the method breaks through the limitation of natural sentence-breaking forms such as punctuation marks in the language characters, and the segmentation is carried out based on any specified target character string, so that a plurality of sentences with punctuation marks broken can be used as a segmentation result, the voice synthesis forms are enriched, and the voice generation efficiency is further improved. In addition, a certain specific word can be used as an independent segmentation result, so that the accuracy of generating the voice is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which some embodiments of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating speech according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for generating speech according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating speech according to the present application;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for generating speech according to the present application;

FIG. 6 is a block diagram of an electronic device for implementing a method for generating speech according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for generating speech or apparatus for generating speech may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as video applications, live applications, instant messaging tools, mailbox clients, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

Here, the

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server may analyze and otherwise process the acquired data such as the dialect, and feed back a processing result (e.g., a voice replying to the user voice) to the terminal device.

It should be noted that the method for generating voice provided in the embodiment of the present application may be executed by the server 105 or the

terminal devices

101, 102, and 103, and accordingly, the apparatus for generating voice may be disposed in the server 105 or the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating speech according to the present application is shown. The method for generating speech comprises the following steps:

step 201, obtaining a dialect aiming at the voice of the user, wherein the dialect comprises a marked target character string.

In the present embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) on which the method for generating voice is executed may acquire a dialog for user voice generation from the present electronic device or other electronic devices. The dialect may be a sequence of characters. In practice, the electronic device described above or other electronic devices may generate words in the following manner: the method comprises the steps of obtaining the voice of a user, converting the voice of the user into characters, and generating reply characters of the characters by utilizing a natural voice processing technology, namely speaking techniques aiming at the voice of the user. The speech converted by the speech technology, i.e., the speech synthesized by the speech technology, is used to reply to the user speech.

In particular, there is a marked local character sequence, i.e. a target string, e.g. the marked target string can be expressed as {% target string% }, or ^ target string ^ or the like. The target string may include at least one character.

Step 202, segmenting the dialect into a plurality of sub-dialects based on the position of the target character string in the dialect, wherein the plurality of sub-dialects include the target sub-dialect corresponding to the target character string and other sub-dialects.

In this embodiment, the execution body may divide the dialog into a plurality of sub-dialogs. The target character string corresponds to a target sub-grammar, and the characters in the target sub-grammar are the characters in the target character string.

In practice, the execution body may divide the utterance into a plurality of sub-utterances in various ways. For example, the execution body may segment the target character string and the designated character string marked in the word operation from the word operation to obtain a first word sub-operation corresponding to the target word sub-operation and the designated character string, and obtain the target word sub-operation and a word sub-operation other than the first word sub-operation.

Step 203, searching the recording information corresponding to at least one sub-dialect in the plurality of sub-dialects in the sub-dialect recording information set.

In this embodiment, the execution main body may search the set of the recording information for the recording information corresponding to at least one of the plurality of sub-dialogues. The audio record information in the set is previously stored audio record information of sub-dialogs, i.e., sub-dialogs audio record information, where the sub-dialogs audio record information may include audio record information of one or more sub-dialogs in the sub-dialogs.

In practice, the executing entity may search the recording information corresponding to each of the at least one sub-dialogues. The at least one sub-dialect may be all of the plurality of sub-dialects, or a portion of the sub-dialects. The recording information may refer to the recording itself, or information associated with the recording, such as an identification of the recording. The recording is a recording converted from the character string included in the sub-dialogs, i.e., a recording synthesized from the character string included in the sub-dialogs.

And step 204, generating voice for replying the voice of the user based on the searched recording information.

In this embodiment, the execution main body may generate a voice replying the voice of the user based on the found recording information, so as to perform voice interaction with the user. In practice, the execution body described above may generate speech in various ways. For example, the execution main body may directly merge the found recording information when the recording information is a recording, or merge the found recording corresponding to each recording information when the recording information is not a recording. The execution subject may then use the merged result as the generated audio record. Specifically, the execution main body may merge the respective recording information or the recordings corresponding to the recording information according to an order in which the sub-dialogs corresponding to the recording information are arranged in the dialogs.

According to the method provided by the embodiment of the application, the recording information corresponding to the sub-phone technology is searched through segmentation, so that the whole phone technology is not required to be synthesized, and the efficiency of voice interaction is improved. In addition, the embodiment breaks through the limitation of natural sentence-breaking forms such as punctuation marks in the language characters, and the segmentation is carried out based on any specified target character string, so that a plurality of sentences with punctuation marks broken off can be used as a segmentation result, the voice synthesis forms are enriched, and the voice generation efficiency is further improved. In addition, the embodiment can also take a certain specific word as an independent segmentation result, thereby being beneficial to improving the accuracy of the generated voice.

In some alternative implementations of this embodiment, the utterance is a sequence of characters; step 202 may include: regarding the edge character of the target character string in at least one target character string in the dialogs, taking the position between the edge character and other adjacent characters as a segmentation position, and segmenting the dialogs into at least one target sub-dialogs and at least one other sub-dialogs, wherein the other characters are characters outside the target character string.

In these alternative implementations, the edge characters are the characters arranged at both ends of the target string, i.e., the first character and the last character. Other characters are adjacent to the edge characters in the dialect and are not characters in the target character string. The number of the target character strings marked in the dialogs can be at least one, namely one or more, and correspondingly, the target sub-dialogs obtained by segmentation can also be one or more. In addition, the other words obtained by cutting can be one or more.

For example, words include X₁X₂X₃Y₁Y₂Wherein X is₁X₂X₃Is a target string, X₁And X₃Are all edge characters, Y₁Other characters. The execution body may be X₃Y₁The position in between is used as a segmentation position, and the segmentation result can comprise a target sub-dialect X₁X₂X₃And other sublinguals Y₁Y₂。

The implementation modes can be used for segmenting according to the edge characters of the target character string, so that the target sub-dialogues corresponding to the target character string and other sub-dialogues except the target sub-dialogues are accurately and completely segmented.

In some optional implementations of this embodiment, the recording information in the set of sub-verbal recording information includes information summary algorithm values corresponding to the recording; step 203 may comprise: for each sub-utterance in the at least one sub-utterance, determining a message digest algorithm value for the sub-utterance; searching the information abstract algorithm value which is the same as the information abstract algorithm value of the sub-dialect in the recording information set of the sub-dialect; and step 204 may include: and acquiring the sound recording corresponding to the searched information abstract algorithm numerical value, and generating the voice for replying the voice of the user, wherein the voice comprises the acquired sound recording.

In these alternative implementations, the execution body may determine, for each sub-utterance in the at least one sub-utterance, a message digest algorithm value for the sub-utterance using a message digest algorithm. Then, the execution main body may search the same information digest algorithm value in the sub-phone-art recording information set, so as to obtain the recording corresponding to the value according to the searched information digest algorithm value, and further generate the voice. The information summarization Algorithm may be various, such as MD (Message-Digest Algorithm)5 or BASE64, etc. Each message digest algorithm value in the sub-phone record message set has a corresponding record. The generated speech may include only the found audio recording, or may include other audio recordings.

The correspondence between the message digest algorithm value and the recording may be embodied in various forms, for example, the message digest algorithm value and the storage address of the recording may be corresponding, so that the message digest algorithm value may correspond to the recording, and the recording may be found by the message digest algorithm value. For another example, the correspondence may also be stored in a correspondence table for representing the information digest algorithm value and the recording (or recording identifier), so that the corresponding recording can be found according to the information digest algorithm value.

These implementations may utilize message digest algorithm values to accurately find recordings that match sub-dialogs, which may improve the accuracy of the generated speech.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating speech according to the present embodiment. In the application scenario of fig. 3, the execution subject 301 obtains a vocabularies 302 for the user's voice, such as esteem XX hello, wherein the vocabularies include a tagged target string XX. The executive body 301 splits the utterance into a plurality of sub-utterances 303 based on the location of the target string in the utterance: respected, XX, your. The execution main body 301 searches the recording information 304 corresponding to at least one sub-dialect in the sub-dialect recording information set. The execution main body 301 generates a voice 305 for replying to the user's voice based on the found recording information 304.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for generating speech is shown. The process 400 includes the following steps:

step 401, obtaining a dialect for a user voice, wherein the dialect includes a marked target character string.

In the present embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) on which the method for generating voice is executed may acquire a dialect for the user voice from the present electronic device or other electronic devices. The dialect may be a sequence of characters. In practice, the electronic device described above or other electronic devices may generate words in the following manner: the method comprises the steps of obtaining the voice of a user, converting the voice of the user into characters, and generating reply characters of the characters by utilizing a natural voice processing technology, namely dialect.

Step 402, segmenting the dialect into a plurality of sub-dialects based on the position of the target character string in the dialect, wherein the plurality of sub-dialects include the target sub-dialect corresponding to the target character string and other sub-dialects.

Step 403, searching for the recording information corresponding to at least one sub-dialect in the plurality of sub-dialects in the sub-dialect recording information set.

Step 404, synthesizing a recording for at least one sub-dialect other than the sub-dialect among the plurality of sub-dialects.

In this embodiment, the executing entity may synthesize the voice records of the sub-dialogs other than the at least one sub-dialogs by using a voice synthesis technique. In practice, the sub-dialects other than the at least one sub-dialects may be all of the plurality of sub-dialects except the at least one sub-dialects. Alternatively, the sub-dialogs other than the at least one sub-dialogs may be partial sub-dialogs other than the at least one sub-dialogs, and in this case, the execution main body may determine the partial sub-dialogs according to a preset rule or randomly.

Step 405, merging the found recording corresponding to the recording information with the synthesized recording to generate a voice for replying the user voice.

In this embodiment, the executing entity may combine the found recording corresponding to the recording information with the recording synthesized in step 404, and use the combined result as the voice of the reply user.

According to the embodiment, the voice of one part of sub-dialogues can be synthesized for the sub-dialogues in the dialogues, and the voice of the other part of sub-dialogues can be searched, so that the voice interaction efficiency can be improved, the accuracy of the generated voice can be ensured through voice synthesis, and the problem that the voice matched with all the content of the dialogues cannot be searched is solved.

In some optional implementations of the present embodiment, the target sub-dialect is one of a fixed sub-dialect and a variable sub-dialect, and the other sub-dialect is the other of the fixed sub-dialect and the variable sub-dialect; step 403 may include: and searching the recording information corresponding to the fixed sub-dialect in the plurality of sub-dialects in the sub-dialect recording information set. Step 404 may include: synthesizing a recording of a variant sub-utterance of the plurality of sub-utterances.

In these alternative implementations, whether the target string labeled in the sub-technology is a fixed sub-technology or a variable sub-technology, the execution main body may search the recording information corresponding to the fixed sub-technology in the sub-technology recording information set. Thus, the recorded information found here may correspond to the target sub-dialogs, as well as to other sub-dialogs.

The obtained dialogs for the user voice may include two sub-dialogs, one is a fixed component in the dialogs, and the sub-dialogs are fixed when aiming at different users and at different times, that is, fixed sub-dialogs composed of fixed characters. The other is a variable component of the dialect, which is variable when aiming at different users or at different times, i.e. a variable sub-dialect composed of variable characters.

For example, the obtained dialog may be "XXX you are good, contact you, pay attention to you have a tail number yyyy credit card in my row, the current RMB bill is aaaa element", the dialog may be divided into 6 sub-dialogs, which includes 3 variable sub-dialogs: "XXX", "yyyy", "aaaa", and 3 fixations: "you are good, contact you, pay attention to you have a tail number", "credit card", the current RMB bill is "Yuan".

The execution body may synthesize the voice recording of the variant sub-dialect using a voice synthesis technique. In practice, if the target sub-dialect is a fixed sub-dialect, then the synthesized sound recordings of other sub-dialects of the plurality of sub-dialects. If the other sub-dialects are the fixed sub-dialects, the synthesized sound recording is the target sub-dialects.

The realization modes can respectively process the fixed part and the variable part in the conversation, synthesize only the part which is possibly changed, and directly obtain the recording corresponding to the fixed part, thereby reducing the conversation content which needs to be synthesized by voice to the maximum extent, effectively improving the interaction efficiency and simultaneously ensuring the accuracy of the synthesized voice.

In some optional implementations of this embodiment, step 403 may include: searching the recording information corresponding to the plurality of sub-dialogues in the sub-dialogues recording information set; and step 404 may include: and in response to the sub-dialogs without the corresponding recording information in the plurality of sub-dialogs, synthesizing the recording corresponding to the sub-dialogs.

In these optional implementation manners, the execution main body may search recording information corresponding to each of the plurality of sub-dialogues in the sub-dialogues recording information set. If in multiple sub-dialects, there are sub-dialects that: if the execution main body does not find the corresponding recording information for the sub-dialogs, the execution main body can synthesize the recording corresponding to the sub-dialogs in real time. Then, the execution main body may combine the found recording information and the synthesized recording into a voice in the reply user language.

The realization modes search the recording information of each sub-phone art in the set as much as possible, thereby utilizing the existing recording as much as possible, reducing the process of synthesizing the voice, effectively shortening the synthesis time of the voice and improving the voice interaction efficiency.

In some optional implementations of this embodiment, the method may further include: for the synthesized sound recording, storing the sound recording in a storage space; determining the information abstract algorithm value of the sub-dialect corresponding to the recording as the information abstract algorithm value to be cached; and correspondingly caching the information abstract algorithm value to be cached and the storage address of the record in the storage space in a sub-dialect record information set in the cache space.

In these alternative implementations, the execution body may store the synthesized audio record. Specifically, it may be stored in a pre-specified storage space. And, the execution body may determine a message digest algorithm value corresponding to the sub-dialect corresponding to the synthesized voice recording. The execution main body may perform corresponding caching on both the message digest algorithm value and the storage address of the recording in a cache space, so that the two have a corresponding relationship. Specifically, the set of sub-verbal recording information may exist in the cache space. The execution body may correspondingly cache the two in the set. The storage space is not a cache space.

The information digest algorithm value corresponding to the sub-dialect may be an information digest algorithm value determined only for the sub-dialect, or may be an information digest algorithm value determined for the sub-dialect and the specific information. Specifically, the specific information here may be, for example, speech synthesis configuration information of a sub-utterance.

The implementation modes can cache the address of the synthesized recording and the corresponding information abstract algorithm value after the voice is synthesized, so that the information abstract algorithm value, namely the recording information, can be quickly acquired, the recording can be immediately acquired according to the address of the recording, and the voice interaction efficiency is improved.

In some optional application scenarios of these implementations, each of the specified sub-dialogs and the plurality of sub-dialogs includes a fixed sub-dialogs and a variable sub-dialogs; in a sub-word record information set in a cache space, information abstract algorithm values and storage addresses corresponding to fixed sub-words and information abstract algorithm values and storage addresses corresponding to variable sub-words are respectively cached in different record information subsets; and/or in the sub-dialect recording information set in the cache space, the information abstract algorithm value and the storage address corresponding to the fixed sub-dialect and the information abstract algorithm value and the storage address corresponding to the variable sub-dialect are provided with different identifications.

In these optional application scenarios, the execution main body or other electronic devices may respectively store information corresponding to the fixed sub-dialect and information corresponding to the variable sub-dialect in the sub-dialect recording information set. Specifically, the information corresponding to the fixed sub-dialect and the information corresponding to the variable sub-dialect may be respectively cached in different recording information subsets in the sub-dialect recording information set. The corresponding information here refers to the corresponding information digest algorithm value and the storage address. In addition, the execution main body or other electronic equipment can set one identifier for information corresponding to the fixed sub-dialect and another identifier for information corresponding to the variable sub-dialect, so that the information corresponding to the fixed sub-dialect and the information corresponding to the variable sub-dialect which are cached in the sub-dialect recording information set have different identifiers.

The application scenes can respectively store the information corresponding to the variable sub-dialect and the information corresponding to the fixed sub-dialect, so that the information corresponding to the sub-dialect can be acquired from the cache more quickly and accurately.

In some optional implementations of this embodiment, the method may further include: acquiring voice synthesis configuration information of a dialect; and step 404 may include: and synthesizing the recording for at least one sub-dialogs except for one of the plurality of sub-dialogs according to the acquired voice synthesis configuration information.

In these alternative implementations, the execution body may obtain speech synthesis configuration information (config) of the speech technology, and synthesize the audio record according to the speech synthesis configuration information. The speech synthesis configuration information herein refers to configuration information required for synthesizing speech, and may include at least one of the following, for example: voice (of the speaker), speed of speech, pitch, voice format.

The realization modes can realize accurate synthesis of the recording based on the acquired voice synthesis configuration information.

In some optional implementations of any of the embodiments above, the method for generating speech may further include: acquiring a sub-dialect set comprising a specified sub-dialect, wherein the sub-dialect set has the same sub-dialect as a plurality of sub-dialects; synthesizing the recording of each specified sub-dialect in the sub-dialect set and storing the recording in a storage space; acquiring a message abstract algorithm value corresponding to each specified sub-dialect in each specified sub-dialect; and for each specified sub-dialect, correspondingly storing the information summary algorithm value corresponding to the specified sub-dialect and the storage address of the record synthesized by the specified sub-dialect in the storage space in the sub-dialect record information set in the cache space.

In these alternative implementations, the execution body may obtain a set of sub-dialogs that includes the specified sub-dialogs. Each sub-utterance in the set of sub-utterances specifies that the same sub-utterance exists among a plurality of sub-utterances in the captured utterance. The execution body may synthesize a recording for each of the designated sub-dialogs and store the synthesized recording in the storage space. The executing agent may then obtain the message digest algorithm values generated locally or by other electronic devices that each specify a sub-routine. The execution main body can correspondingly store the information abstract algorithm value obtained by the appointed sub-dialogues and the storage address of the sound record synthesized by the appointed sub-dialogues in the cache space.

The implementation modes can correspondingly store the information summary algorithm value and the storage address of the sound record in the cache in advance, so that the sound record corresponding to the sub-dialect can be conveniently and accurately acquired.

In some optional application scenarios of these implementations, the message digest algorithm value corresponding to any sub-grammar of a grammar is determined based on both: any sub-dialect, and speech synthesis configuration information for any sub-dialect.

In these alternative application scenarios, for any sub-dialog of a dialog, the executing entity or other electronic device may determine the information summarization algorithm value corresponding to the sub-dialog based on: the sub-dialogs, and speech synthesis configuration information for the sub-dialogs. The voice synthesis configuration information of the sub-dialogues is the voice synthesis configuration information of the dialogues in which the sub-dialogues are located, and the voice synthesis configuration information of each sub-dialogues in the sub-dialogues is the same.

The execution subject or other electronic device may determine based on both in various ways. For example, the executing entity may determine the message digest algorithm values of the two message digest algorithms by using the message digest algorithm, and use the values as the message digest algorithm values corresponding to the sub-dialogs. In addition, the execution body can also determine the message digest algorithm values of the two and other messages by using the message digest algorithm, and the message digest algorithm values are used as the message digest algorithm values corresponding to the sub-dialogs. Other information here may be identification information of the word in which the sub-word is located, and the like.

These application scenarios may help to synthesize the desired speech by storing speech synthesis configuration information to cache more detailed speech synthesis information.

In some optional application scenarios of these implementation manners, in a sub-dialect recording information set in a cache space, an information summary algorithm value and a storage address corresponding to a fixed sub-dialect, and an information summary algorithm value and a storage address corresponding to a variable sub-dialect are respectively cached in different recording information subsets; and/or in the sub-dialect recording information set in the cache space, the information abstract algorithm value and the storage address corresponding to the fixed sub-dialect and the information abstract algorithm value and the storage address corresponding to the variable sub-dialect are provided with different identifications.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating speech, which corresponds to the embodiment of the method shown in fig. 2, and which may include the same or corresponding features or effects as the embodiment of the method shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating speech of the present embodiment includes: an acquisition unit 501, a segmentation unit 502, a search unit 503, and a generation unit 504. The acquiring unit 501 is configured to acquire a dialect for the user voice, wherein the dialect includes a marked target character string; a segmentation unit 502 configured to segment the dialog into a plurality of sub-dialogs based on a position of the target character string in the dialog, wherein the plurality of sub-dialogs includes the target sub-dialogs corresponding to the target character string and other sub-dialogs; a searching unit 503 configured to search, in the sub-utterance recording information set, recording information corresponding to at least one sub-utterance of the plurality of sub-utterances; a generating unit 504 configured to generate a voice for replying to the user voice based on the found recording information.

In this embodiment, specific processes of the obtaining unit 501, the segmentation unit 502, the search unit 503, and the generation unit 504 of the apparatus 500 for generating a speech and technical effects brought by the specific processes may refer to related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of the embodiment, the synthesis unit is configured to synthesize the sound recording for at least one sub-utterance of the plurality of sub-utterances other than the sub-utterance; and the generating unit is further configured to execute generating a voice for replying to the user voice based on the found recording information as follows: and combining the sound recording corresponding to the searched sound recording information with the synthesized sound recording to generate the voice for replying the voice of the user.

In some optional implementations of this embodiment, the searching unit is further configured to search, in the sub-utterance record information set, record information corresponding to at least one sub-utterance of the plurality of sub-utterances according to the following manner: searching the recording information corresponding to the plurality of sub-dialogues in the sub-dialogues recording information set; and a synthesis unit, further configured to perform synthesis of the recording for at least one sub-dialect other than the plurality of sub-dialects as follows: and in response to the sub-dialogs without the corresponding recording information in the plurality of sub-dialogs, synthesizing the recording corresponding to the sub-dialogs.

In some optional implementations of the present embodiment, the target sub-dialect is one of a fixed sub-dialect and a variable sub-dialect, and the other sub-dialect is the other of the fixed sub-dialect and the variable sub-dialect; the searching unit is further configured to search the recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set according to the following modes: searching recording information corresponding to a fixed sub-phone technology in a plurality of sub-phone technologies in the sub-phone technology recording information set; and a synthesis unit, further configured to perform synthesis of the recording for at least one sub-dialect other than the plurality of sub-dialects as follows: synthesizing a recording of a variant sub-utterance of the plurality of sub-utterances.

In some optional implementations of this embodiment, the recording information in the set of sub-verbal recording information includes information summary algorithm values corresponding to the recording; the searching unit is further configured to search the recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set according to the following modes: for each sub-utterance in the at least one sub-utterance, determining a message digest algorithm value for the sub-utterance; searching the information abstract algorithm value which is the same as the information abstract algorithm value of the sub-dialect in the recording information set of the sub-dialect; and the generating unit is further configured to execute generating a voice for replying to the user voice based on the found recording information as follows: and acquiring the sound recording corresponding to the searched information abstract algorithm numerical value, and generating the voice for replying the voice of the user, wherein the voice comprises the acquired sound recording.

In some optional implementations of this embodiment, the apparatus further includes: a set acquisition unit configured to acquire a sub-utterance set including a specified sub-utterance, wherein the same sub-utterance exists in the sub-utterance set as in a plurality of sub-utterances; a first storage unit configured to synthesize the sound recordings of the respective specified sub-dialogs in the sub-dialogs set, and store the sound recordings in a storage space; a value obtaining unit configured to obtain an information digest algorithm value corresponding to each of the specified sub-dialects; and the first cache unit is configured to correspondingly cache the information summary algorithm value corresponding to the specified sub-dialogs and the storage addresses of the sound records synthesized by the specified sub-dialogs in the storage space in the sub-dialogs sound record information set in the cache space for each specified sub-dialogs.

In some optional implementations of this embodiment, the apparatus further includes: a second storage unit configured to store the sound recording to a storage space for the synthesized sound recording; a value determining unit configured to determine an information summarization algorithm value of a sub-dialect corresponding to the recording as an information summarization algorithm value to be cached; and the second cache unit is configured to correspondingly cache the information summary algorithm value to be cached and the storage address of the record in the storage space in the sub-language record information set in the cache space.

In some optional implementations of the present embodiment, each of the specified sub-dialects and the plurality of sub-dialects includes a fixed sub-dialects and a variable sub-dialects; in a sub-word record information set in a cache space, information abstract algorithm values and storage addresses corresponding to fixed sub-words and information abstract algorithm values and storage addresses corresponding to variable sub-words are respectively cached in different record information subsets; and/or in the sub-dialect recording information set in the cache space, the information abstract algorithm value and the storage address corresponding to the fixed sub-dialect and the information abstract algorithm value and the storage address corresponding to the variable sub-dialect are provided with different identifications.

In some optional implementations of this embodiment, the message digest algorithm value corresponding to any sub-dialect of the dialect is determined based on: any sub-dialect, and speech synthesis configuration information for any sub-dialect.

In some optional implementations of this embodiment, the apparatus further includes: a configuration acquisition unit configured to acquire speech synthesis configuration information of a speech technology; and a synthesis unit, further configured to perform synthesis of the recording for at least one sub-dialect other than the plurality of sub-dialects as follows: and synthesizing the recording for at least one sub-dialogs except for one of the plurality of sub-dialogs according to the acquired voice synthesis configuration information.

In some optional implementations of the embodiment, the segmentation unit is further configured to perform segmentation of the dialog into a plurality of sub-dialogs based on the position of the target string in the dialog as follows: regarding the edge character of the target character string in at least one target character string in the dialogs, taking the position between the edge character and other adjacent characters as a segmentation position, and segmenting the dialogs into at least one target sub-dialogs and at least one other sub-dialogs, wherein the other characters are characters outside the target character string.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, is a block diagram of an electronic device for generating speech according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for generating speech provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for generating speech provided by the present application.

The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for generating speech in the embodiment of the present application (for example, the obtaining unit 501, the slicing unit 502, the searching unit 503, and the generating unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing, i.e., implements the method for generating voice in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic apparatus for generating voice, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, which may be connected to an electronic device for generating speech over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for generating speech may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic apparatus for generating speech, such as an input device like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, etc. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an obtaining unit, a segmentation unit, a lookup unit, and a generation unit. Where the names of the units do not in some cases constitute a limitation of the unit itself, for example, the acquiring unit may also be described as a "unit for acquiring dialogs for a user's voice".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: obtaining a dialect for a user voice, wherein the dialect comprises a marked target character string; segmenting the dialect into a plurality of sub-dialects based on the position of the target character string in the dialect, wherein the plurality of sub-dialects comprise the target sub-dialect corresponding to the target character string and other sub-dialects; searching the recording information corresponding to at least one sub-phone operation in the plurality of sub-phone operations in the sub-phone operation recording information set; and generating voice for replying the voice of the user based on the searched recording information.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for generating speech, the method comprising:

obtaining a dialect for a user voice, wherein the dialect comprises a marked target character string;

segmenting the dialect into a plurality of sub-dialects based on the position of the target character string in the dialect, wherein the plurality of sub-dialects comprise a target sub-dialect corresponding to the target character string and other sub-dialects;

searching the recording information corresponding to at least one sub-phone operation in the plurality of sub-phone operations in the sub-phone operation recording information set;

and generating voice for replying the voice of the user based on the searched recording information.

2. The method of claim 1, wherein the method further comprises:

synthesizing a recording for sub-dialogs other than the at least one sub-dialogs from the plurality of sub-dialogs; and

generating a voice for replying the user voice based on the searched recording information, including:

and combining the sound record corresponding to the searched sound record information with the synthesized sound record to generate the voice for replying the voice of the user.

3. The method of claim 2, wherein the searching for the recording information corresponding to at least one sub-utterance of the plurality of sub-utterances in the sub-utterance recording information set comprises:

searching the recording information corresponding to the plurality of sub-dialogues in the sub-dialogues recording information set; and

in the pair of sub-dialogs, a sub-dialogs other than the at least one sub-dialogs is synthesized into a sound recording, including:

and responding to the sub-dialogs without the corresponding recording information in the plurality of sub-dialogs, and synthesizing the recording corresponding to the sub-dialogs.

4. The method of claim 2, wherein the target sub-dialect is one of a fixed sub-dialect and a variable sub-dialect, and the other sub-dialect is the other of the fixed sub-dialect and the variable sub-dialect;

in the sub-utterance recording information set, searching for recording information corresponding to at least one sub-utterance of the plurality of sub-utterances, including:

searching the recording information corresponding to the fixed sub-phone technology in the sub-phone technologies in the sub-phone technology recording information set; and

synthesizing a recording of a variable sub-utterance of the plurality of sub-utterances.

5. The method of any of claims 1-4, wherein the sound recording information in the set of sub-verbal sound recording information includes a message digest algorithm value corresponding to the sound recording;

for each sub-dialect in the at least one sub-dialect, determining a message digest algorithm value corresponding to the sub-dialect;

searching an information abstract algorithm value which is the same as the information abstract algorithm value corresponding to the sub-dialect in the recording information set of the sub-dialect; and

and acquiring the sound recording corresponding to the searched information abstract algorithm numerical value, and generating the voice which comprises the acquired sound recording and is used for replying the voice of the user.

6. The method of claim 1, wherein the method further comprises:

obtaining a sub-utterance set comprising a specified sub-utterance, wherein the same sub-utterance exists in the sub-utterance set as in the plurality of sub-utterances;

synthesizing the sound recording of each specified sub-dialect in the sub-dialect set and storing the sound recording in a storage space;

acquiring a message abstract algorithm value corresponding to each specified sub-dialect in each specified sub-dialect;

and for each specified sub-dialect, correspondingly caching the information summary algorithm value corresponding to the specified sub-dialect and the storage address of the record synthesized by the specified sub-dialect in the storage space in the record information set of the sub-dialect in the cache space.

7. The method of claim 2, wherein the method further comprises:

for the synthesized sound recording, storing the sound recording in a storage space;

determining an information abstract algorithm value corresponding to the sub-dialect corresponding to the recording as an information abstract algorithm value to be cached;

and correspondingly caching the information abstract algorithm numerical value to be cached and the storage address of the record in the storage space in the sub-speech record information set in the cache space.

8. The method of claim 6 or 7, wherein each of the specified sub-dialogs and the plurality of sub-dialogs includes a fixed sub-dialogs and a variable sub-dialogs;

in the sub-word record information set in the cache space, the information abstract algorithm value and the storage address corresponding to the fixed sub-word and the information abstract algorithm value and the storage address corresponding to the variable sub-word are respectively cached in different record information subsets; and/or in the sub-dialect recording information set in the cache space, the information abstract algorithm value and the storage address corresponding to the fixed sub-dialect and the information abstract algorithm value and the storage address corresponding to the variable sub-dialect are provided with different identifications.

9. The method of claim 6 or 7, wherein the message digest algorithm value corresponding to any sub-utterance of the utterance is determined based on both: the any sub-dialogs, and speech synthesis configuration information for the any sub-dialogs.

10. The method of claim 2, wherein the method further comprises:

acquiring voice synthesis configuration information of the dialogues; and

and synthesizing the sound recording for the sub-dialogs except for the at least one sub-dialogs in the plurality of sub-dialogs according to the acquired voice synthesis configuration information.

11. The method of claim 1, wherein the segmenting the dialect into a plurality of sub-dialects based on a location of the target string in the dialect comprises:

and regarding the edge character of the target character string in at least one target character string in the dialogs, taking the position between the edge character and other adjacent characters as a segmentation position, and segmenting the dialogs into at least one target sub-dialogs and at least one other sub-dialogs, wherein the other characters are characters outside the target character string.

12. An apparatus for generating speech, the apparatus comprising:

an acquisition unit configured to acquire a dialect for a user voice, wherein the dialect includes a tagged target character string;

a segmentation unit configured to segment the dialect into a plurality of sub-dialects based on a position of the target character string in the dialect, wherein the plurality of sub-dialects include a target sub-dialect corresponding to the target character string and other sub-dialects;

the searching unit is configured to search the recording information corresponding to at least one sub-dialect in the plurality of sub-dialects in the sub-dialect recording information set;

and the generating unit is configured to generate voice for replying the voice of the user based on the searched sound recording information.

13. The apparatus of claim 12, wherein the apparatus further comprises:

a synthesizing unit configured to synthesize a recording for a sub-utterance other than the at least one sub-utterance among the plurality of sub-utterances; and

the generating unit is further configured to execute the generating of the voice for replying to the user voice based on the found recording information as follows:

14. The apparatus of claim 13, wherein the searching unit is further configured to perform the searching for the recording information corresponding to at least one sub-utterance of the plurality of sub-utterances from the set of sub-utterance recording information as follows:

the synthesis unit is further configured to perform the synthesizing of the plurality of sub-dialogs into the sound recording according to a sub-dialogs other than the at least one sub-dialogs as follows:

15. The apparatus of claim 13, wherein the target sub-dialect is one of a fixed sub-dialect and a variable sub-dialect, the other sub-dialect being the other of the fixed sub-dialect and the variable sub-dialect;

the searching unit is further configured to search the recording information corresponding to at least one sub-dialect in the plurality of sub-dialects in the sub-dialect recording information set according to the following manner:

16. The apparatus of one of claims 12-15, wherein the sound recording information in the set of sub-verbal sound recording information includes a message digest algorithm value corresponding to the sound recording;

for each sub-utterance in the at least one sub-utterance, determining a message digest algorithm value for the sub-utterance;

searching the information abstract algorithm value which is the same as the information abstract algorithm value of the sub-dialect in the recording information set of the sub-dialect; and

17. The apparatus of claim 12, wherein the apparatus further comprises:

a set acquisition unit configured to acquire a sub-utterance set including a specified sub-utterance, wherein the same sub-utterance exists in the sub-utterance set as in the plurality of sub-utterances;

a first storage unit configured to synthesize the sound recording of each specified sub-utterance in the sub-utterance set and store the sound recording in a storage space;

a value obtaining unit configured to obtain an information digest algorithm value corresponding to each of the respective specified sub-dialogs;

and the first cache unit is configured to correspondingly cache the information summary algorithm value corresponding to the specified sub-dialogs and the storage addresses of the sound records synthesized by the specified sub-dialogs in the storage space in the sound record information set of the sub-dialogs in the cache space for each specified sub-dialogs.

18. The apparatus of claim 13, wherein the apparatus further comprises:

a second storage unit configured to store the sound recording to a storage space for the synthesized sound recording;

a value determining unit configured to determine an information summarization algorithm value of a sub-dialect corresponding to the recording as an information summarization algorithm value to be cached;

and the second cache unit is configured to correspondingly cache the information summary algorithm value to be cached and the storage address of the record in the storage space in the sub-language record information set in the cache space.

19. The apparatus of claim 17 or 18, wherein each of the designated sub-dialogs and the plurality of sub-dialogs includes a fixed sub-dialogs and a variable sub-dialogs;

20. The apparatus of claim 17 or 18, wherein the message digest algorithm value corresponding to any sub-utterance of the utterance is determined based on both: the any sub-dialogs, and speech synthesis configuration information for the any sub-dialogs.

21. The apparatus of claim 13, wherein the apparatus further comprises:

a configuration acquisition unit configured to acquire speech synthesis configuration information of the dialogs; and

a synthesizing unit further configured to perform the synthesizing of the plurality of sub-dialogs, the sub-dialogs other than the at least one sub-dialogs into a recording as follows:

22. The apparatus of claim 12, wherein the segmentation unit is further configured to perform the segmenting the utterance into a plurality of sub-utterances based on a position of the target string in the utterance as follows:

23. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-11.

24. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-11.