CN111599341B

CN111599341B - Method and device for generating voice

Info

Publication number: CN111599341B
Application number: CN202010401740.1A
Authority: CN
Inventors: 官山山; 刘晓丰; 唐涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2023-06-20
Anticipated expiration: 2040-05-13
Also published as: CN111599341A

Abstract

The application discloses a method and a device for generating voice, and relates to the technical field of cloud computing. The specific embodiment comprises the following steps: acquiring a speaking operation aiming at user voices, wherein the speaking operation comprises a marked target character string; dividing the speaking operation into a plurality of sub-speaking operations based on the position of the target character string in the speaking operation, wherein the plurality of sub-speaking operations comprise a target sub-speaking operation corresponding to the target character string and other sub-speaking operations; searching recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set; and generating voice for replying to the voice of the user based on the searched recording information. According to the scheme, through segmentation, the recording information corresponding to the sub-voice operation is searched, so that real-time synthesis of the whole voice operation is not needed, and the efficiency of voice interaction is improved.

Description

Method and device for generating voice

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of voice, and particularly relates to a method and a device for generating voice.

Background

The application of the Text To Speech (TTS) technology is becoming more and more widespread, and in particular, speech synthesis can be applied in the context of human-computer interaction. For example, the electronic device may be engaged in a conversation with a user. In the call process, the electronic device can acquire the voice of the user, convert the voice into characters, perform natural voice processing on the characters, and then generate reply voice based on the processing result.

In the related art, as the interactive language becomes more complex, i.e. the number of text characters increases, the time consumed by the speech synthesis engine increases, possibly resulting in the user speaking the speech, and the machine device needs a period of time to make feedback.

Disclosure of Invention

Provided are a method, apparatus, electronic device, and storage medium for generating voice.

According to a first aspect, there is provided a method for generating speech, comprising: acquiring a speaking operation aiming at user voices, wherein the speaking operation comprises a marked target character string; dividing the phone operation into a plurality of sub-phone operations based on the position of the target character string in the phone operation, wherein the plurality of sub-phone operations comprise a target sub-phone operation corresponding to the target character string and other sub-phone operations; searching recording information corresponding to at least one sub-phone in a plurality of sub-phones in the sub-phone recording information set; based on the searched recording information, generating voice for replying to the voice of the user.

According to a second aspect, there is provided an apparatus for generating speech, comprising: an acquisition unit configured to acquire a conversation for a user voice, wherein the conversation includes a marked target character string; a segmentation unit configured to segment the conversation into a plurality of sub-conversations based on a position of the target character string in the conversation, wherein the plurality of sub-conversations includes a target sub-conversation corresponding to the target character string and other sub-conversations; the searching unit is configured to search recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set; and a generation unit configured to generate a voice for replying to the user voice based on the found recording information.

According to a third aspect, there is provided an electronic device comprising: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method as in any of the embodiments of the method for generating speech.

According to a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as any of the embodiments of the method for generating speech.

According to the scheme, the recording information corresponding to the sub-voice operation is searched through segmentation, so that the whole voice operation is not required to be synthesized, and the efficiency of voice interaction is improved. In addition, the method breaks the limitation of natural sentence-breaking forms such as punctuation marks in language characters, and the like, and based on segmentation of any appointed target character strings, a plurality of sentences with the punctuation marks broken can be used as a segmentation result, so that the speech synthesis form is enriched, and the speech generating efficiency is further improved. In addition, the method and the device can also take a specific word as an independent segmentation result, thereby being beneficial to improving the accuracy of generating the voice.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which some embodiments of the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method for generating speech according to the present application;

FIG. 3 is a schematic illustration of one application scenario of a method for generating speech according to the present application;

FIG. 4 is a flow chart of yet another embodiment of a method for generating speech according to the present application;

FIG. 5 is a schematic structural diagram of one embodiment of an apparatus for generating speech according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a method for generating speech according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the methods for generating speech or the apparatus for generating speech of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as video-type applications, live applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, electronic book readers, laptop and desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server may analyze and process the acquired data such as the speech, and feed back the processing result (for example, the voice of the reply user voice) to the terminal device.

It should be noted that, the method for generating speech provided in the embodiment of the present application may be performed by the server 105 or the

terminal devices

101, 102, 103, and accordingly, the apparatus for generating speech may be provided in the server 105 or the

terminal devices

101, 102, 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating speech according to the present application is shown. The method for generating speech comprises the following steps:

step 201, a speaking operation for a user voice is obtained, wherein the speaking operation comprises a marked target character string.

In this embodiment, an execution body (e.g., a server or a terminal device shown in fig. 1) on which the method for generating speech is run may acquire a speech for user speech generation from the present electronic device or other electronic devices. The term may be a sequence of characters. In practice, the above-described electronic device or other electronic devices may generate speech in the following manner: the method comprises the steps of obtaining the voice of a user, converting the voice of the user into characters, and generating reply characters of the characters by utilizing a natural voice processing technology, namely speaking operation aiming at the voice of the user. The speech converted by speech surgery, i.e. the speech synthesized by speech surgery, is used to reply to the user's speech.

Specifically, there is a marked local character sequence in the speaking, i.e., a target character string, such as the marked target character string may be expressed as {% target character string }, or. The target string may include at least one character.

Step 202, dividing the phone operation into a plurality of sub-phone operations based on the position of the target character string in the phone operation, wherein the plurality of sub-phone operations include a target sub-phone operation corresponding to the target character string and other sub-phone operations.

In this embodiment, the execution body may divide the session into a plurality of sub-sessions. The target character string corresponds to the target sub-phone, and the characters in the target sub-phone are the characters in the target character string.

In practice, the execution body may divide the phone call into a plurality of sub-phone calls in various ways. For example, the execution body may segment the target string and the specified string marked in the phone operation from the phone operation, obtain the target phone operation and the first phone operation corresponding to the specified string, and obtain the target phone operation and the phone operation other than the first phone operation.

Step 203, searching recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set.

In this embodiment, the executing body may search the recording information corresponding to at least one of the plurality of sub-microphone in the set of recording information. The recording information in the set is the recording information of the sub-phone operation which is stored in advance, namely the recording information of the sub-phone operation, wherein the recording information of the sub-phone operation can comprise the recording information of one or more sub-phone operations in the phone operation.

In practice, the executing body may search the recording information corresponding to each of the at least one sub-phone. The at least one sub-phone operation may be all or part of the plurality of sub-phone operations. The recording information herein may refer to the recording itself, or may refer to information related to the recording, such as an identification of the recording. The sound recording is the sound recording converted from the character string included in the sub-phone operation, namely the sound recording synthesized by the character string included in the sub-phone operation.

Step 204, generating voice for replying to the voice of the user based on the searched recording information.

In this embodiment, the executing body may generate, based on the searched recording information, a voice for returning the voice of the user for voice interaction with the user. In practice, the above-described execution subject may generate speech in various ways. For example, the executing body may directly combine the searched recording information when the recording information is recording, or combine the recording corresponding to the searched recording information when the recording information is not recording. The execution body may then use the combined result as the generated sound recording. Specifically, the executing body may combine the recording information or the recording corresponding to the recording information according to the sequence of the sub-phone corresponding to the recording information in the phone operation.

According to the method provided by the embodiment of the application, the recording information corresponding to the sub-phone operation is searched through segmentation, so that the whole phone operation is not required to be synthesized, and the efficiency of voice interaction is improved. In addition, the embodiment breaks the limitation of natural sentence-breaking forms such as punctuation marks in language characters, and the like, and based on the segmentation of any appointed target character string, a plurality of punctuation marks which are broken can be used as a segmentation result, so that the speech synthesis form is enriched, and the speech generating efficiency is further improved. In addition, the embodiment can also take a specific word as an independent segmentation result, thereby being beneficial to improving the accuracy of generating the voice.

In some optional implementations of this embodiment, the foregoing is a sequence of characters; step 202 may include: and for edge characters of a target character string in at least one target character string in the phone operation, taking the position between the edge character and other adjacent characters as a segmentation position, and segmenting the phone operation into at least one target sub-phone operation and at least one other sub-phone operation, wherein the other characters are characters except the target character string.

In these alternative implementations, the edge characters are characters arranged at both ends of the target string, i.e., the first character and the last character. Other characters are adjacent to the edge character in the conversation and are not characters in the target string. The number of the target strings marked in the conversation may be at least one, that is, one or more, and correspondingly, the target sub-conversation obtained by segmentation may be one or more. In addition, the other sub-phones obtained by segmentation can be one or more.

For example, the speaking includes X ₁ X ₂ X ₃ Y ₁ Y ₂ Wherein X is ₁ X ₂ X ₃ X is the target character string ₁ And X ₃ Are all edge characters, Y ₁ Other characters. The execution body may execute X ₃ Y ₁ The position between the two is taken as a segmentation position, and the segmentation result can comprise a target sub-phone X ₁ X ₂ X ₃ And other sub-utterances Y ₁ Y ₂ 。

The implementation methods can be used for cutting according to the edge characters of the target character string, so that the target sub-phone corresponding to the target character string and other sub-phones except the target sub-phone can be accurately and completely cut.

In some optional implementations of this embodiment, the recording information in the sub-voice recording information set includes an information summarization algorithm value corresponding to the recording; step 203 may include: for each sub-phone of the at least one sub-phone, determining a message digest algorithm value for the sub-phone; searching the information abstract algorithm value which is the same as the information abstract algorithm value of the sub-voice in the sub-voice recording information set; step 204 may include: and acquiring the sound record corresponding to the searched information abstract algorithm value, and generating the voice which comprises the acquired sound record and is used for replying the voice of the user.

In these alternative implementations, the execution entity may determine, for each of the at least one sub-phone, a message digest algorithm value for the sub-phone using a message digest algorithm. Then, the executing body can search the same information abstract algorithm value in the sub-voice recording information set, so as to obtain a recording corresponding to the value according to the searched information abstract algorithm value, and further generate voice. The Message Digest Algorithm herein may be various, such as MD (Message-Digest Algorithm) 5 or BASE64, and the like. Each information abstraction algorithm value in the sub-voice recording information set has a corresponding recording. The generated voice may include only the searched sound recording, and may also include other sound recordings.

The correspondence between the information summarization algorithm value and the record may be embodied in various forms, for example, the information summarization algorithm value and the storage address of the record may be corresponding, so that the information summarization algorithm value may correspond to the record, and the record may be found by the information summarization algorithm value. For another example, the correspondence may be stored in a correspondence table for characterizing the information summary algorithm value and the recording (or recording identifier), so that the corresponding recording may be found according to the information summary algorithm value.

The implementation modes can accurately find the sound recording matched with the sub-speech operation by utilizing the numerical value of the information summarization algorithm, so that the accuracy of the generated voice can be improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating speech according to the present embodiment. In the application scenario of fig. 3, the executing body 301 acquires a speaking 302 for the user's voice, such as esteem XX your good, where the speaking includes the marked target character string XX. The execution body 301 divides the conversation into a plurality of sub-conversations 303 based on the position of the target character string in the conversation: respect, XX, your good. The execution subject 301 searches for recording information 304 corresponding to at least one of the plurality of sub-phones in the sub-phone recording information set. The execution body 301 generates a voice 305 for replying to the user's voice based on the found recording information 304.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating speech is shown. The process 400 includes the steps of:

step 401, obtaining a speaking operation aiming at the voice of the user, wherein the speaking operation comprises a marked target character string.

In this embodiment, an execution body (e.g., a server or a terminal device shown in fig. 1) on which the method for generating a voice is run may acquire a speaking for a user voice from the present electronic device or other electronic devices. The term may be a sequence of characters. In practice, the above-described electronic device or other electronic devices may generate speech in the following manner: the method comprises the steps of obtaining the voice of a user, converting the voice of the user into characters, and generating reply characters of the characters, namely speaking, by utilizing a natural voice processing technology.

Step 402, dividing the phone into a plurality of sub-phones based on the position of the target character string in the phone, wherein the plurality of sub-phones includes the target sub-phone corresponding to the target character string and other sub-phones.

Step 403, searching recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set.

Step 404, synthesizing the sound recording for at least one sub-phone outside the sub-phone in the plurality of sub-phones.

In this embodiment, the execution body may synthesize the sound recordings of the sub-phone operation other than the at least one sub-phone operation by using a speech synthesis technique. In practice, the sub-utterances other than the at least one sub-utterances may be all of the plurality of sub-utterances except the at least one sub-utterances. Alternatively, the sub-phone operation other than the at least one sub-phone operation may be a partial sub-phone operation other than the at least one sub-phone operation among the plurality of sub-phone operations, and in this case, the execution subject may determine the partial sub-phone operation according to a predetermined rule or randomly.

Step 405, merging the sound record corresponding to the searched sound record information with the synthesized sound record to generate a voice for replying to the voice of the user.

In this embodiment, the executing body may combine the sound record corresponding to the searched sound record information with the sound record synthesized in step 404, and use the combined result as the voice of the reply user voice.

According to the method and the device for synthesizing the sub-phone operation in the phone operation, the voice of one sub-phone operation is synthesized, and the voice of the other sub-phone operation is searched, so that the accuracy of voice generation can be ensured through voice synthesis while the voice interaction efficiency is improved, and the problem that the voice matched with the content of all the phones cannot be searched is avoided.

In some alternative implementations of the present embodiments, the target sub-phone is one of a fixed sub-phone and a variable sub-phone, and the other sub-phone is the other of a fixed sub-phone and a variable sub-phone; step 403 may include: and searching recording information corresponding to the fixed sub-phone in the plurality of sub-phones in the sub-phone recording information set. Step 404 may include: synthesizing variable sub-phone recordings in a plurality of sub-phones.

In these alternative implementations, whether the target string marked in the phone call is a fixed phone call or a variable phone call, the executing entity may search the record information corresponding to the fixed phone call in the record information set of the phone call. Thus, the sound recording information searched for here may correspond to the target sub-phone operation, or may correspond to other sub-phone operations.

In the acquired speech for the user, two sub-speech techniques, one being a fixed component of the speech, may be included, which sub-speech techniques are fixed for different users and at different times, i.e. fixed sub-speech techniques of fixed character composition. The other is a variable component in the sub-phone, which is variable for different users or at different times, i.e. a variable sub-phone of variable character composition.

For example, the obtained phone call may be "XXX you good, contact you, pay attention to you have a tail number yyyy credit card in I'm line, the RMB bill of this period is aaaa Yuan", and the phone call may be divided into 6 sub phone call operations including 3 variable phone call operations: "XXX", "yyyy", "aaaa", and 3 fixed-stator phone: "you good, you are linked to you, pay attention to you have a tail number in me line", "credit card, RMB bill in this period" Yuan ".

The execution body can synthesize the sound recordings of variable sub-phones in a plurality of sub-phones by utilizing a voice synthesis technology. In practice, if the target sub-phone is a fixed sub-phone, then the synthesized is a recording of the other sub-phones of the plurality of sub-phones. If the other sub-phone is a fixed sub-phone, the synthesized sound recording of the target sub-phone.

The realization modes can respectively process the fixed part and the variable part in the conversation, only synthesize the part which is possibly changed, and directly acquire the sound recording corresponding to the fixed part, thereby reducing the conversation content which needs to be synthesized in the voice to the maximum extent, effectively improving the interaction efficiency and simultaneously ensuring the accuracy of the synthesized voice.

In some alternative implementations of the present embodiment, step 403 may include: searching recording information corresponding to a plurality of sub-telephone operation respectively in the sub-telephone operation recording information set; step 404 may include: and in response to the sub-phone operation without corresponding recording information in the plurality of sub-phone operations, synthesizing the recording corresponding to the sub-phone operation.

In these optional implementations, the executing body may search the recording information corresponding to each of the plurality of sub-utterances in the sub-utterances recording information set. If among multiple sub-phones, there are such sub-phones: the executing main body does not find the corresponding recording information for the sub-phone, and the executing main body can synthesize the recording corresponding to the sub-phone in real time. And then, the execution main body can combine the searched recording information and the synthesized recording into the voice replying to the user language.

The implementation modes search the recording information of each sub-phone operation in the set as much as possible, thereby maximally utilizing the existing recording, reducing the process of synthesizing the voice, effectively shortening the synthesis time of the voice and improving the voice interaction efficiency.

In some optional implementations of this embodiment, the method may further include: for the synthesized sound recording, storing the sound recording into a storage space; determining the information abstract algorithm value of the sub-voice corresponding to the record as the information abstract algorithm value to be cached; and correspondingly caching the information abstract algorithm value to be cached and the storage address of the record in the storage space in the sub-voice operation record information set in the cache space.

In these alternative implementations, the executive may store the synthesized sound recording for the sound recording. Specifically, it may be stored in a storage space designated in advance. And the execution subject can determine the information abstract algorithm value corresponding to the sub-speech operation corresponding to the synthesized sound recording. The execution body can correspondingly cache the numerical value of the information abstract algorithm and the storage address of the recording in a cache space, so that the numerical value of the information abstract algorithm and the storage address of the recording have a corresponding relation. Specifically, the above-described sub-voice recording information set may exist in the buffer space. The execution body may correspondingly cache both in the set. The storage space described above is not a cache space.

The information abstract algorithm value corresponding to the sub-phone may be an information abstract algorithm value determined only for the sub-phone, or may be an information abstract algorithm value determined for the sub-phone and the specified information. Specifically, the specification information here may be, for example, speech synthesis configuration information of a sub-phone.

After the voice is synthesized, the address of the synthesized voice recording and the corresponding information abstract algorithm value can be cached, so that the information abstract algorithm value, namely the voice recording information, can be quickly obtained, and then the voice recording can be obtained according to the address of the voice recording, so that the voice interaction efficiency is improved.

In some optional application scenarios of these implementations, each of the specified sub-phones and the plurality of sub-phones includes a fixed sub-phone and a variable sub-phone; in a sub-voice recording information set in a cache space, fixing the information abstract algorithm value and the storage address corresponding to the sub-voice, and respectively caching the information abstract algorithm value and the storage address corresponding to the variable sub-voice in different recording information subsets; and/or, in the sub-phone recording information set in the buffer space, fixing the information abstract algorithm value and the storage address corresponding to the sub-phone, and carrying out different identifications with the information abstract algorithm value and the storage address corresponding to the variable sub-phone.

In these optional application scenarios, the executing body or other electronic devices may store, in the sub-phone recording information set, information corresponding to the fixed sub-phone and information corresponding to the variable sub-phone respectively. Specifically, the information corresponding to the fixed sub-phone operation and the information corresponding to the variable sub-phone operation can be respectively cached in different recording information subsets in the sub-phone operation recording information set. The corresponding information here refers to the corresponding information digest algorithm value and the memory address. In addition, the executing body or other electronic devices may set an identifier for the information corresponding to the fixed sub-phone, and set another identifier for the information corresponding to the variable sub-phone, so that the information corresponding to the fixed sub-phone and the information corresponding to the variable sub-phone cached in the sub-phone recording information set have different identifiers.

The application scenes can store the information corresponding to the variable sub-phone operation and the information corresponding to the fixed sub-phone operation respectively, so that the information corresponding to the sub-phone operation can be acquired from the cache more quickly and accurately.

In some optional implementations of this embodiment, the method may further include: acquiring voice synthesis configuration information of a speaking operation; step 404 may include: and synthesizing the sound recording for at least one sub-phone operation except the sub-phone operation in the plurality of sub-phone operations according to the acquired voice synthesis configuration information.

In these alternative implementations, the executing entity may obtain speech synthesis configuration information (config) for the speech surgery and synthesize the sound recording according to the speech synthesis configuration information. The speech synthesis configuration information herein refers to configuration information required for synthesizing speech, and may include, for example, at least one of: sound (of speaker), speech speed, pitch, speech format.

The implementation modes can realize accurate synthesis of the sound recording based on the acquired voice synthesis configuration information.

In some optional implementations of any of the foregoing embodiments, the method for generating speech may further include: obtaining a sub-phone operation set comprising a designated sub-phone operation, wherein the sub-phone operation set has the same sub-phone operation as a plurality of sub-phone operations; synthesizing the sound recordings of each designated sub-phone in the sub-phone set and storing the sound recordings in a storage space; acquiring the information abstract algorithm value corresponding to each appointed sub-phone in each appointed sub-phone; and for each designated sub-phone, correspondingly storing the information abstract algorithm value corresponding to the designated sub-phone and the storage address of the record synthesized by the designated sub-phone in a storage space in a sub-phone record information set in a cache space.

In these alternative implementations, the execution body may obtain a set of sub-utterances that includes the specified sub-utterances. The same sub-phone operation exists in each designated sub-phone operation in the sub-phone operation set as in the plurality of sub-phone operations in the acquired phone operation. The execution body may synthesize a sound recording for each specified sub-voice and store the synthesized sound recording in the storage space. Then, the execution body may obtain the information abstract algorithm value corresponding to each designated sub-phone generated by the local or other electronic devices. The execution body can correspondingly store the information abstract algorithm value obtained for the appointed sub-phone operation and the storage address of the sound recording synthesized by the appointed sub-phone operation in the cache space.

The implementation modes can correspondingly store the information abstract algorithm value and the storage address of the sound recording in the cache in advance, thereby being convenient for quickly and accurately acquiring the sound recording corresponding to the sub-phone operation.

In some optional application scenarios of these implementations, the information summarization algorithm value corresponding to any sub-phone of the phone is determined based on both: any sub-phone, and speech synthesis configuration information for any sub-phone.

In these alternative application scenarios, for any one sub-phone of a phone, the executing entity or other electronic device may determine the value of the message digest algorithm corresponding to the sub-phone based on both: the sub-phone and the voice synthesis configuration information of the sub-phone. The voice synthesis configuration information of the sub-phone is the voice synthesis configuration information of the phone where the sub-phone is located, and the voice synthesis configuration information of each sub-phone in the phone where the sub-phone is located is the same.

The execution subject or other electronic device described above may be determined based on both in various ways. For example, the executing body may determine the numerical value of the information summarization algorithm of the two by using the information summarization algorithm, and use the numerical value as the numerical value of the information summarization algorithm corresponding to the sub-speech operation. In addition, the executing body can also determine the information summarization algorithm values of the two information and other information by using the information summarization algorithm, and the information summarization algorithm values are used as the information summarization algorithm values corresponding to the sub-speech operation. Other information here may be identification information of the sub-phone where the sub-phone is located, etc.

These application scenarios may facilitate synthesizing the desired speech by storing speech synthesis configuration information to cache more detailed speech synthesis information.

In some optional application scenarios of these implementations, in the sub-voice recording information set in the buffer space, the information abstract algorithm value and the storage address corresponding to the sub-voice are fixed, and the information abstract algorithm value and the storage address corresponding to the variable sub-voice are respectively buffered in different recording information subsets; and/or, in the sub-phone recording information set in the buffer space, fixing the information abstract algorithm value and the storage address corresponding to the sub-phone, and carrying out different identifications with the information abstract algorithm value and the storage address corresponding to the variable sub-phone.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for generating speech, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the embodiment of the apparatus may further include the same or corresponding features or effects as the embodiment of the method shown in fig. 2, except for the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating speech of the present embodiment includes: an acquisition unit 501, a segmentation unit 502, a search unit 503, and a generation unit 504. Wherein the obtaining unit 501 is configured to obtain a speaking operation for a user voice, wherein the speaking operation includes a marked target character string; a segmentation unit 502 configured to segment the phone call into a plurality of sub-phones based on the position of the target character string in the phone call, wherein the plurality of sub-phones includes a target sub-phone corresponding to the target character string and other sub-phones; a searching unit 503 configured to search recording information corresponding to at least one of the plurality of sub-telephone technologies in the sub-telephone technology recording information set; and a generating unit 504 configured to generate a voice for replying to the user voice based on the found recording information.

In this embodiment, the specific processing and the technical effects brought by the acquiring unit 501, the slicing unit 502, the searching unit 503 and the generating unit 504 of the apparatus 500 for generating speech may refer to the relevant descriptions of the

steps

201, 202, 203 and 204 in the corresponding embodiment of fig. 2, and are not repeated here.

In some optional implementations of this embodiment, the synthesizing unit is configured to synthesize the sound recording for at least one sub-phone outside of the plurality of sub-phones; and the generating unit is further configured to perform generating a voice for replying to the user voice based on the found recording information as follows: and combining the sound record corresponding to the searched sound record information with the synthesized sound record to generate the voice for replying the voice of the user.

In some optional implementations of this embodiment, the searching unit is further configured to perform searching, in the sub-voice recording information set, recording information corresponding to at least one of the plurality of sub-voices in the sub-voice recording information set in the following manner: searching recording information corresponding to a plurality of sub-telephone operation respectively in the sub-telephone operation recording information set; and a synthesizing unit further configured to perform synthesizing the sound recording for at least one sub-phone outside the sub-phone among the plurality of sub-phones as follows: and in response to the sub-phone operation without corresponding recording information in the plurality of sub-phone operations, synthesizing the recording corresponding to the sub-phone operation.

In some alternative implementations of the present embodiments, the target sub-phone is one of a fixed sub-phone and a variable sub-phone, and the other sub-phone is the other of a fixed sub-phone and a variable sub-phone; the searching unit is further configured to perform searching of recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set in the following manner: searching recording information corresponding to a fixed sub-phone in a plurality of sub-phones in a sub-phone recording information set; and a synthesizing unit further configured to perform synthesizing the sound recording for at least one sub-phone outside the sub-phone among the plurality of sub-phones as follows: synthesizing variable sub-phone recordings in a plurality of sub-phones.

In some optional implementations of this embodiment, the recording information in the sub-voice recording information set includes an information summarization algorithm value corresponding to the recording; the searching unit is further configured to perform searching of recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set in the following manner: for each sub-phone of the at least one sub-phone, determining a message digest algorithm value for the sub-phone; searching the information abstract algorithm value which is the same as the information abstract algorithm value of the sub-voice in the sub-voice recording information set; and the generating unit is further configured to perform generating a voice for replying to the user voice based on the found recording information as follows: and acquiring the sound record corresponding to the searched information abstract algorithm value, and generating the voice which comprises the acquired sound record and is used for replying the voice of the user.

In some optional implementations of this embodiment, the apparatus further includes: a set acquisition unit configured to acquire a sub-phone operation set including specified sub-phone operations, wherein the same sub-phone operation exists in the sub-phone operation set as in the plurality of sub-phone operations; a first storage unit configured to synthesize sound recordings of respective specified sub-utterances in the sub-utterances set and store them in a storage space; the numerical value acquisition unit is configured to acquire the numerical value of the information abstraction algorithm corresponding to each designated sub-phone in the designated sub-phones; the first buffer unit is configured to buffer the information abstract algorithm value corresponding to the appointed sub-phone and the storage address of the record synthesized by the appointed sub-phone in the storage space correspondingly in the sub-phone record information set in the buffer space for each appointed sub-phone.

In some optional implementations of this embodiment, the apparatus further includes: a second storage unit configured to store the synthesized sound recording to the storage space; the value determining unit is configured to determine the information abstract algorithm value of the sub-voice corresponding to the sound recording as the information abstract algorithm value to be cached; and the second caching unit is configured to correspondingly cache the information abstract algorithm value to be cached and the storage address of the record in the storage space in the sub-voice record information set in the caching space.

In some alternative implementations of the present embodiments, each of the specified sub-phones and the plurality of sub-phones includes a fixed sub-phone and a variable sub-phone; in a sub-voice recording information set in a cache space, fixing the information abstract algorithm value and the storage address corresponding to the sub-voice, and respectively caching the information abstract algorithm value and the storage address corresponding to the variable sub-voice in different recording information subsets; and/or, in the sub-phone recording information set in the buffer space, fixing the information abstract algorithm value and the storage address corresponding to the sub-phone, and carrying out different identifications with the information abstract algorithm value and the storage address corresponding to the variable sub-phone.

In some optional implementations of this embodiment, the information summarization algorithm value corresponding to any one sub-phone of the phone is determined based on both: any sub-phone, and speech synthesis configuration information for any sub-phone.

In some optional implementations of this embodiment, the apparatus further includes: a configuration acquisition unit configured to acquire speech synthesis configuration information of a speaking; and a synthesizing unit further configured to perform synthesizing the sound recording for at least one sub-phone outside the sub-phone among the plurality of sub-phones as follows: and synthesizing the sound recording for at least one sub-phone operation except the sub-phone operation in the plurality of sub-phone operations according to the acquired voice synthesis configuration information.

In some optional implementations of this embodiment, the segmentation unit is further configured to perform segmentation of the phone into a plurality of sub-phones based on the location of the target string in the phone as follows: and for edge characters of a target character string in at least one target character string in the phone operation, taking the position between the edge character and other adjacent characters as a segmentation position, and segmenting the phone operation into at least one target sub-phone operation and at least one other sub-phone operation, wherein the other characters are characters except the target character string.

According to embodiments of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 6, is a block diagram of an electronic device for generating a method of speech according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 6.

Memory 602 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the methods for generating speech provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for generating speech provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium, and may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., the acquisition unit 501, the segmentation unit 502, the search unit 503, and the generation unit 504 shown in fig. 5) corresponding to a method for generating speech in the embodiments of the present application. The processor 601 executes various functional applications of the server and data processing, i.e., implements the method for generating speech in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the electronic device for generating speech, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory located remotely from processor 601, which may be connected to the electronic device for generating speech via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for generating a method of speech may further include: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device used to generate the speech, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: the processor comprises an acquisition unit, a segmentation unit, a search unit and a generation unit. The names of these units do not constitute a limitation on the unit itself in some cases, and the acquisition unit may also be described as "a unit that acquires a speech for a user's voice", for example.

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a speaking operation aiming at user voices, wherein the speaking operation comprises a marked target character string; dividing the phone operation into a plurality of sub-phone operations based on the position of the target character string in the phone operation, wherein the plurality of sub-phone operations comprise a target sub-phone operation corresponding to the target character string and other sub-phone operations; searching recording information corresponding to at least one sub-phone in a plurality of sub-phones in the sub-phone recording information set; based on the searched recording information, generating voice for replying to the voice of the user.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method for generating speech, the method comprising:

acquiring a speaking operation aiming at user voices, wherein the speaking operation comprises a marked target character string;

dividing the speaking operation into a plurality of sub-speaking operations based on the position of the target character string in the speaking operation, wherein the plurality of sub-speaking operations comprise a target sub-speaking operation corresponding to the target character string and other sub-speaking operations;

searching recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set;

generating voice for replying to the user voice based on the searched recording information;

The recording information in the sub-voice operation recording information set comprises information abstract algorithm values corresponding to the recording;

the searching the recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set comprises the following steps:

for each sub-phone operation in the at least one sub-phone operation, determining the corresponding information abstract algorithm value of the sub-phone operation;

searching the information abstract algorithm value which is the same as the information abstract algorithm value corresponding to the sub-voice in the sub-voice recording information set; and

the generating the voice for replying to the user voice based on the searched recording information comprises the following steps:

and acquiring the sound record corresponding to the searched information abstract algorithm value, and generating the voice which comprises the acquired sound record and is used for replying the voice of the user.

2. The method of claim 1, wherein the method further comprises:

synthesizing sound recordings for sub-telephone operation except the at least one sub-telephone operation in the plurality of sub-telephone operations; and

and combining the sound record corresponding to the searched sound record information with the synthesized sound record to generate the voice for replying the voice of the user.

3. The method of claim 2, wherein the searching for the recording information corresponding to at least one of the plurality of sub-phones in the sub-phone recording information set includes:

searching the recording information corresponding to each of the plurality of sub-phones in the sub-phone recording information set; and

the synthesizing the sound recording for the sub-phone operation except the at least one sub-phone operation in the plurality of sub-phone operations comprises the following steps:

and in response to the sub-phone operation, synthesizing the sound record corresponding to the sub-phone operation, wherein the sub-phone operation does not have the corresponding sound record information.

4. The method of claim 2, wherein the target sub-phone is one of a fixed sub-phone and a variable sub-phone, the other sub-phone being the other of the fixed sub-phone and variable sub-phone;

searching recording information corresponding to a fixed sub-phone operation in the plurality of sub-phone operations in the sub-phone operation recording information set; and

Synthesizing the variable sub-phone records in the plurality of sub-phones.

5. The method of claim 1, wherein the method further comprises:

obtaining a sub-phone operation set comprising a designated sub-phone operation, wherein the sub-phone operation set has the same sub-phone operation as the plurality of sub-phone operations;

synthesizing the sound recordings of each designated sub-phone in the sub-phone set and storing the sound recordings in a storage space;

acquiring the information abstract algorithm value corresponding to each appointed sub-phone operation in each appointed sub-phone operation;

and for each designated sub-phone, correspondingly caching the information abstract algorithm value corresponding to the designated sub-phone and the storage address of the record synthesized by the designated sub-phone in the storage space in the sub-phone record information set in the cache space.

6. The method of claim 2, wherein the method further comprises:

for the synthesized sound recording, storing the sound recording into a storage space;

determining the information abstract algorithm value corresponding to the sub-voice operation corresponding to the record as the information abstract algorithm value to be cached;

and correspondingly caching the information abstract algorithm value to be cached and the storage address of the record in the storage space in the sub-voice record information set in the cache space.

7. The method of claim 5, wherein each of the respective specified sub-utterances and the plurality of sub-utterances includes a fixed sub-utterances and a variable sub-utterances;

in the sub-voice recording information set in the cache space, fixing the information abstract algorithm value and the storage address corresponding to the sub-voice, and respectively caching the information abstract algorithm value and the storage address corresponding to the variable sub-voice in different recording information subsets; and/or, in the sub-phone recording information set in the buffer space, fixing the information abstract algorithm value and the storage address corresponding to the sub-phone, and carrying out different identifications with the information abstract algorithm value and the storage address corresponding to the variable sub-phone.

8. The method of claim 5 or 6, wherein the information summarization algorithm value corresponding to any sub-phone of the phone is determined based on both: the any one of the sub-phones, and the speech synthesis configuration information of the any one of the sub-phones.

9. The method of claim 2, wherein the method further comprises:

acquiring voice synthesis configuration information of the speaking operation; and

And synthesizing the sound recording for the sub-telephone operation except for the at least one sub-telephone operation in the plurality of sub-telephone operations according to the acquired voice synthesis configuration information.

10. The method of claim 1, wherein the slicing the utterance into a plurality of sub-utterances based on the location of the target string in the utterance comprises:

and for edge characters of at least one target character string in the conversation, dividing the conversation into at least one target sub-conversation and at least one other sub-conversation by taking the position between the edge characters and adjacent other characters as dividing positions, wherein the other characters are characters except the target character string.

11. An apparatus for generating speech, the apparatus comprising:

an acquisition unit configured to acquire a conversation for a user voice, wherein the conversation includes a marked target character string;

a segmentation unit configured to segment the conversation into a plurality of sub-conversations based on a position of the target character string in the conversation, wherein the plurality of sub-conversations includes a target sub-conversation corresponding to the target character string and other sub-conversations;

The searching unit is configured to search recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set;

a generation unit configured to generate a voice for replying to the user voice based on the found recording information;

the searching unit is further configured to perform searching for recording information corresponding to at least one sub-phone in the plurality of sub-phones in the sub-phone recording information set according to the following manner:

for each sub-phone of the at least one sub-phone, determining a message digest algorithm value for the sub-phone;

searching the information abstract algorithm value which is the same as the information abstract algorithm value of the sub-voice in the sub-voice recording information set; and

the generating unit is further configured to perform the generating of the voice for replying to the user voice based on the found recording information as follows:

12. The apparatus of claim 11, wherein the apparatus further comprises:

a synthesizing unit configured to synthesize a sound recording for a sub-phone operation other than the at least one sub-phone operation among the plurality of sub-phone operations; and

13. The apparatus of claim 12, wherein the lookup unit is further configured to perform the lookup of the recording information corresponding to at least one of the plurality of sub-utterances in the set of sub-utterances according to:

the synthesizing unit is further configured to perform the synthesizing of the sound recordings for sub-phones other than the at least one sub-phone of the plurality of sub-phones as follows:

14. The apparatus of claim 12, wherein the target sub-phone is one of a fixed sub-phone and a variable sub-phone, the other sub-phone being the other of the fixed sub-phone and variable sub-phone;

synthesizing the variable sub-phone records in the plurality of sub-phones.

15. The apparatus of claim 11, wherein the apparatus further comprises:

a set acquisition unit configured to acquire a sub-phone operation set including specified sub-phone operations, wherein the same sub-phone operation exists in the sub-phone operation set as in the plurality of sub-phone operations;

a first storage unit configured to synthesize sound recordings of respective specified sub-utterances in the set of sub-utterances and store them in a storage space;

The numerical value obtaining unit is configured to obtain the numerical value of the information abstraction algorithm corresponding to each appointed sub-phone operation in the appointed sub-phone operations;

the first buffer unit is configured to buffer the information abstract algorithm value corresponding to the appointed sub-phone and the storage address of the record synthesized by the appointed sub-phone in the storage space correspondingly in the sub-phone record information set in the buffer space for each appointed sub-phone.

16. The apparatus of claim 12, wherein the apparatus further comprises:

a second storage unit configured to store the synthesized sound recording to the storage space;

the value determining unit is configured to determine the information abstract algorithm value of the sub-voice corresponding to the sound recording as the information abstract algorithm value to be cached;

and the second caching unit is configured to correspondingly cache the information abstract algorithm value to be cached and the storage address of the record in the storage space in the sub-voice record information set in the caching space.

17. The apparatus of claim 15, wherein each of the respective specified sub-utterances and the plurality of sub-utterances includes a fixed sub-utterances and a variable sub-utterances;

18. The apparatus of claim 15 or 16, wherein the information summarization algorithm value corresponding to any sub-phone of the phone is determined based on both: the any one of the sub-phones, and the speech synthesis configuration information of the any one of the sub-phones.

19. The apparatus of claim 12, wherein the apparatus further comprises:

a configuration acquisition unit configured to acquire speech synthesis configuration information of the speaking; and

a synthesizing unit further configured to perform the synthesizing of the sound recordings for sub-phones other than the at least one sub-phone of the plurality of sub-phones as follows:

20. The apparatus of claim 11, wherein the segmentation unit is further configured to perform the segmentation of the utterance into a plurality of sub-utterances based on a position of the target string in the utterance as follows:

21. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-10.

22. A computer-readable storage medium having a computer program stored thereon, wherein,

The program, when executed by a processor, implements the method of any of claims 1-10.