CN113096637A - Speech synthesis method, apparatus and computer readable storage medium - Google Patents

Speech synthesis method, apparatus and computer readable storage medium Download PDF

Info

Publication number
CN113096637A
CN113096637A CN202110640245.0A CN202110640245A CN113096637A CN 113096637 A CN113096637 A CN 113096637A CN 202110640245 A CN202110640245 A CN 202110640245A CN 113096637 A CN113096637 A CN 113096637A
Authority
CN
China
Prior art keywords
synthesis
subtask
target text
subtasks
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110640245.0A
Other languages
Chinese (zh)
Other versions
CN113096637B (en
Inventor
徐灿
叶旭文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110640245.0A priority Critical patent/CN113096637B/en
Publication of CN113096637A publication Critical patent/CN113096637A/en
Application granted granted Critical
Publication of CN113096637B publication Critical patent/CN113096637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Abstract

The embodiment of the disclosure provides a voice synthesis method, a device and a computer readable storage medium, wherein the voice synthesis method comprises the following steps: in response to the received target text, performing segmentation processing on the target text to obtain at least two sub-texts corresponding to the target text; generating at least two synthesis subtasks based on the at least two sub-texts, wherein each synthesis subtask is used for indicating that the corresponding sub-text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub-text; determining the processing priority of each of at least two synthesizing subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks. The disclosed embodiments are used for speech synthesis.

Description

Speech synthesis method, apparatus and computer readable storage medium
Technical Field
The disclosed embodiments relate to the field of computer technologies, and in particular, to a speech synthesis method and apparatus, and a computer-readable storage medium.
Background
The speech synthesis technology can convert text data into audio data for playing, and has wide application in many technologies, such as online education industry, translation industry, etc. Generally, a speech synthesis model can be used to perform speech synthesis on text data to obtain audio data, but the speech synthesis model often cannot meet the real-time requirement in terms of speed.
Disclosure of Invention
In view of the above, embodiments of the present disclosure provide a speech synthesis method, apparatus and computer readable storage medium to overcome the defect of poor real-time speech synthesis due to the processing speed of the synthesis model.
In a first aspect, an embodiment of the present disclosure provides a speech synthesis method, which includes: in response to the received target text, performing segmentation processing on the target text to obtain at least two sub-texts corresponding to the target text; generating at least two synthesis subtasks based on the at least two subfiles, wherein each synthesis subtask is used for indicating that the corresponding subfile is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one subfile; determining the processing priority of each of at least two synthesizing subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks.
In a second aspect, an embodiment of the present disclosure provides a speech synthesis apparatus, including: the segmentation module is configured to respond to the received target text and segment the target text to obtain at least two sub-texts corresponding to the target text; the task module is configured to generate at least two synthesis subtasks based on the at least two sub-texts, wherein each synthesis subtask is used for indicating that the corresponding sub-text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub-text; the priority module is configured to determine the processing priority of each of the at least two synthesis subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and the voice synthesis module is configured to execute the synthesis subtask according to the determined processing priority to obtain the audio data corresponding to the synthesis subtask.
In a third aspect, an embodiment of the present disclosure provides an electronic device, which includes: at least one processor and a memory. The memory stores at least one program that, when executed by the at least one processor, causes the at least one processor to implement a method according to an embodiment of the disclosure.
In a fourth aspect, the embodiments disclosed herein provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to implement the speech synthesis method according to the present disclosure as described in the first aspect or any some embodiments of the first aspect.
The speech synthesis method, the speech synthesis device and the computer-readable storage medium provided by the embodiment of the disclosure are used for responding to a received target text, and segmenting the target text to obtain at least two sub-texts corresponding to the target text; generating at least two synthesis subtasks based on the at least two sub texts, wherein each synthesis subtask is used for indicating that the corresponding sub text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub text; determining the processing priority of each of at least two synthesizing subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks. Because the target text is segmented into at least two sub-texts and at least two synthesis sub-tasks are generated, one target text can be delivered for multiple times according to the synthesis sub-tasks, and only one synthesis sub-task is completed, one section of corresponding audio data can be output, so that the real-time performance of voice synthesis is improved.
Drawings
Some specific embodiments of the disclosed embodiments are described in detail below by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow chart of a method of speech synthesis according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a slicing effect according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of processing priorities according to an embodiment of the present disclosure;
FIG. 4 is a diagram illustrating an effect of a parallel process according to an embodiment of the disclosure;
FIG. 5 is an architecture diagram of a method of speech synthesis according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
The following further describes specific implementations of the embodiments of the present disclosure with reference to the drawings of the embodiments of the present disclosure.
Examples
Fig. 1 is a flow chart of a speech synthesis method according to an embodiment of the present disclosure. The speech synthesis method comprises the following steps:
101. and responding to the received target text, and performing segmentation processing on the target text to obtain at least two sub-texts corresponding to the target text.
In the present disclosure, the text contains text information, or the text is used to record text information, the target text is text waiting for speech synthesis, and the number of the target text may be one or more. Speech synthesis refers to the generation of audio data corresponding to text from the text, or the conversion of text to audio.
It should be noted that the number of target texts may be at least one, and the segmentation process may be performed on each target text. The target text may be obtained by a speech synthesis request, optionally, in one example, at least one speech synthesis request is received, one speech synthesis request corresponding to the at least one target text, and the speech synthesis request is used for indicating that the corresponding target text is converted into audio data output. And are intended to be exemplary only.
Optionally, in some embodiments of the present disclosure, the segmenting process includes word segmentation, and wherein the segmenting process is performed on the target text to obtain at least two sub-texts corresponding to the target text, including: performing word segmentation on a target text to obtain m phrases, wherein m is an integer larger than 0; and determining at least two sub texts corresponding to the target text based on the m word groups. Further optionally, in other embodiments of the present disclosure, the segmenting process further includes recombining, and wherein determining at least two sub-texts corresponding to the target text based on the m word groups includes: based on the processing speed of voice synthesis, m phrases are recombined to obtain n sub-texts, wherein n is an integer larger than 0 and is not larger than m. It should be noted that a phrase may be the smallest unit of speech synthesis, a phrase may contain one or more words, and a sub-phrase may contain one or more consecutive phrases. In the present disclosure, the text may include various language characters such as chinese characters, english characters, japanese characters, and the like. Exemplarily, as shown in fig. 2, fig. 2 is a schematic diagram illustrating a segmentation effect according to an embodiment of the present disclosure, fig. 2 shows a target text 21, where the target text 21 includes characters "the people's republic of china is a great country", and after segmenting the target text, 6 phrases 22 are obtained, which are: the phrases "the people's republic of china", "yes", "one", "great", "of" and "country" are the smallest speech synthesis units, and are not divisible, then "the people's republic of china" is taken as one sub-text 23, and "yes" and "one" are combined into one sub-text 23, and "great", "of" and "country" are combined into one sub-text 23, so that 3 sub-texts are obtained. And are intended to be exemplary only.
102. At least two composition subtasks are generated based on the at least two subfolders.
Each synthesis subtask is used for indicating that the corresponding sub-text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub-text. It should be noted that, if there are multiple target texts, each target text may be segmented to obtain at least one sub-text, and a set of the sub-texts obtained by segmenting all the target texts is at least two sub-texts corresponding to the target text.
Optionally, in some embodiments of the present disclosure, each of the sub-texts may generate one synthesis subtask, or a part of the sub-texts may generate a synthesis subtask, and generating at least two synthesis subtasks based on the at least two sub-texts includes: responding to the audio data corresponding to the child text contained in the cache database, and acquiring the audio data from the cache database; and in response to determining that the cache database does not contain audio data corresponding to the sub-text, generating a synthesis subtask for the sub-text; at least two composition subtasks are obtained. If the cache database contains the audio data corresponding to the sub-texts, the audio data can be directly obtained from the cache database, the sub-texts do not need to be generated into synthesis subtasks, the calculation amount and the resource occupation are reduced, the synthesis subtasks are generated for the sub-texts of which the cache database does not contain the corresponding audio data, each sub-text is judged, and at least two synthesis subtasks are obtained.
103. The processing priority of each of the at least two synthesis subtasks is determined based on the request time of the target text and the order of the subtasks in the target text.
With reference to the example of step 101, the request time of the target text may be the time of obtaining the speech synthesis request corresponding to the target text, and the request time of each target text may be the same or different.
Here, a specific embodiment is illustrated to determine the processing priority of at least two composition subtasks. Optionally, in some embodiments of the present disclosure, the number of the target texts is at least two, and determining a processing priority of each of the at least two synthesis subtasks based on a request time of the target text and an order of the subtasks in the target text includes:
obtaining the delivery time of each synthesis subtask in at least two synthesis subtasks according to the request time of the target text to which the synthesis subtask belongs and the sequence of the subtasks in the target text, wherein the delivery time of each synthesis subtask is used for indicating the time for completing the synthesis subtask; processing priorities of at least two composition subtasks are determined based on the lead time.
The processing priority is determined for the synthesis subtask according to the delivery time, so that the synthesis subtask which should be delivered earlier can be processed preferentially, and the real-time performance of the speech synthesis is improved.
Based on the foregoing embodiments, further, a specific implementation is listed to explain how to determine the delivery time, and optionally, obtaining the delivery time of each of the at least two composite subtasks according to the request time of the target text to which the composite subtask belongs and the sequence of the subtasks in the target text, including:
for each synthesis subtask, determining the offset time of a subtask corresponding to the synthesis subtask according to the sequence of the subtask in the target text, wherein the offset time is used for indicating the offset of the time for starting execution of the synthesis subtask relative to the request time of the target text to which the synthesis subtask belongs; and determining the delivery time of at least two synthesis subtasks according to the request time of the target text to which the synthesis subtask belongs, the offset time of the subtask corresponding to the synthesis subtask, and the processing speed of the voice synthesis. It should be noted that the offset time of the sub-text corresponding to the synthesis sub-task may be the sum of processing times of all sub-texts before the sub-text in the target text to which the sub-text belongs.
For example, the processing time may be calculated by dividing the data amount of the sub-text corresponding to the synthesis sub-task by the processing speed, and the sum of the processing time, the request time of the target text to which the synthesis sub-task belongs, and the offset time of the sub-text corresponding to the synthesis sub-task is added as the delivery time of the synthesis sub-task, which is only exemplified here. Illustratively, as shown in fig. 3, fig. 3 is a schematic diagram of processing priorities according to an embodiment of the disclosure. Fig. 3 shows three speech synthesis requests of target text, respectively a speech synthesis request a, a speech synthesis request B, a speech synthesis request C, in fig. 3, the speech synthesis request a is marked 31, the speech synthesis request B is marked 32, the speech synthesis request C is marked with a bit 33, wherein each requested speech synthesis task is divided into 3 synthesis subtasks, wherein the delivery time of the synthesis subtask a1 is earliest, the processing priority is highest, the processing time of the synthesis subtask a2 is longer, the processing priority is lower than the synthesis subtask B1 with the earlier delivery time, the delivery times of the synthesis subtasks a2, B2, C1 are the same, the processing priorities of a2, B2, C1 may be the same, and therefore, among all the synthesis subtasks, according to the highest processing priority is a1, next is B1, again is a2, B2, C1, again is A3, B3, again, C2, the lowest processing priority is C3.
104. And executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks.
Optionally, in combination with the specific embodiment in step 103, determining the processing priority according to the delivery time, in some embodiments of the present disclosure, the method further includes: and outputting the audio data based on the delivery time of the synthesis subtask when the synthesis subtask is completed and the audio data of the synthesis subtask is obtained. It should be noted that outputting the audio data may be sending the audio data to the requesting party, or directly playing the audio data. Since the processing priority is determined based on the delivery time in this embodiment, the audio data can be output at the delivery time so that each synthesizing subtask delivers and outputs the audio data at the delivery time, improving the real-time of speech synthesis. With reference to the description of fig. 3, one synthesis subtask may generate one audio data, the audio data of a1 may be output after a1 is delivered, and in the process of outputting the audio data of a1, the synthesis subtasks B1, a2, and the like are executed, and then the audio data of B1 and the audio data of a2 may be output, without waiting until all target texts corresponding to a are speech-synthesized to obtain one synthesis audio and then output, which improves the real-time performance of speech synthesis.
Optionally, in other embodiments of the present disclosure, after the audio data corresponding to each sub-document is spliced to obtain the synthesized audio and then output, the synthesizing subtask is executed based on the processing priority, and the audio data corresponding to the synthesizing subtask is obtained, the method further includes: and in the at least two synthesis subtasks of the target text, after at least one synthesis subtask is completed to obtain at least one audio data, performing audio splicing on the at least one audio data of the target text to obtain a synthesized audio, and outputting the synthesized audio. It should be noted that, here, at least one audio data of the target text refers to audio data corresponding to all sub-texts or audio data corresponding to part of sub-texts included in one target text, and in other embodiments, at least two audio data of multiple target texts may be spliced to obtain a synthesized audio, and the synthesized audio is output. With reference to the embodiment in step 102, at least one audio data corresponding to the target text may be obtained directly from the cache database by executing a synthesis subtask, and a part of the audio data is obtained by performing speech synthesis; or all the at least one audio data corresponding to the target text is obtained from the buffer database; or at least one audio data corresponding to the target text is obtained by performing speech synthesis through executing a synthesis subtask. Illustratively, the Format of the audio data is explained, and as shown in the following table, the audio data may include a Resources exchange File Format (RIFF) part, a layout text File (fmt) part, and a data (data) part.
Figure 838589DEST_PATH_IMAGE001
Wherein the RIFF part contains ID (identification), Size (Size) and Type (Type) of RIFF, the ID of RIFF is used to indicate that the current part is RIFF, the Size of RIFF indicates the total number of bytes excluding ID and Size, and the Type of RIFF indicates the format of audio data, for example, the audio format may be lossless wav format;
the fmt section may contain the ID, Size, Audio Format (Audio Format), number of Channels (Num Channels), Sample Rate (Sample Rate), number of bytes Per second (Byte Rate), number of samples (Block Align), Sample width (Bits Per Sample). Wherein, the ID of fmt indicates that the current part is fmt, and the Size of fmt indicates the number of bytes excluding the ID and the Size;
the data part can contain ID, Size and data, the ID of the data indicates that the current part is the data part, the Size of the data indicates the number of bytes without the ID and Size, and the data is the audio content.
In the splicing process, the header (0-44 bytes) of the first audio data and the data part (44-other bytes) of the rest audio data are taken, and then the Size of the RIFF part and the Size of the data part are recalculated, so that the audio data are spliced. And are intended to be exemplary only.
It should be further noted that, when executing the composition subtask, the plurality of composition subtasks may be processed in parallel by using a plurality of computing nodes to improve the processing efficiency, and a specific example is respectively described here:
optionally, in the first example, executing the synthesis subtask based on the processing priority to obtain the audio data corresponding to the synthesis subtask includes: and respectively issuing the at least two synthesis subtasks to the at least two computing nodes based on the processing priority, so that the at least two computing nodes execute the synthesis subtasks in parallel and receive the audio data sent by the computing nodes. Exemplarily, as shown in fig. 4, fig. 4 is a schematic diagram illustrating an effect of parallel processing according to an embodiment of the present disclosure, and a synthesis subtask 41 is allocated to 3 computing nodes for parallel processing according to processing priorities, so that efficiency of speech synthesis is improved, and instantaneity of speech synthesis is further improved.
The speech synthesis method described in conjunction with the above steps 101-104 is shown in fig. 5, and fig. 5 is an architecture diagram of a speech synthesis method according to an embodiment of the disclosure. Fig. 5 shows a calling terminal 501, a forwarding server 502 and a speech synthesis terminal 503.
It should be noted that the calling terminal 501, the forwarding server terminal 502, and the speech synthesis terminal 503 may be integrated on one device, or may be independent multiple devices, and the application does not limit the specific hardware implementation form of the architecture.
The calling terminal 501 may perform data interaction with the terminal device, may receive a speech synthesis request and a target text sent by the terminal device, and the calling terminal 501 may transmit the target text to the forwarding server 502 according to the speech synthesis request.
The forwarding server 502 may perform word segmentation on the target text to obtain at least two word groups, then recombine the at least two word groups to obtain at least two sub-texts, search for audio data corresponding to the sub-texts in a cache (i.e., a cache database), and if audio data already exists, directly obtain the audio data from the cache without performing speech synthesis and place the audio data in an audio pool; and generating a synthesis subtask for the sub-text of which the corresponding audio data is not found in the cache to obtain at least two synthesis subtasks, and adding the at least two synthesis subtasks into the task pool. The task pool may include a plurality of target text composition subtasks that are prioritized together for processing. Then, at least two synthesizing subtasks are issued to the speech synthesizing end 503 according to the processing priority.
In one implementation, the speech synthesis end 503 processes the synthesis subtasks in parallel by using a plurality of computing nodes, transmits the audio data obtained by the synthesis subtasks to the audio pool of the forwarding server 502, the forwarding server 502 splices the audio data according to the synthesis subtasks to obtain audio data corresponding to the sub-documents, transmits the corresponding audio data to the calling end 501 according to the delivery time of each synthesis subtask, and the calling end 501 transmits the audio data to the terminal device in real time, so that the terminal device plays the audio data, the terminal device can play the audio data in real time, or after all the audio data of a target text are transmitted, splices at least one audio data of the target text to obtain a synthesis audio, and plays the synthesis audio. In another implementation manner, the speech synthesis end 503 also waits for the audio data of all the sub-texts of one target text to be acquired, splices at least one audio data of the target text to obtain a synthesized audio, and the calling end 501 transmits the synthesized audio to the terminal device.
The voice synthesis method disclosed by the embodiment of the disclosure is used for responding to the received target text, and segmenting the target text to obtain at least two sub-texts corresponding to the target text; generating at least two synthesis subtasks based on the at least two sub texts, wherein each synthesis subtask is used for indicating that the corresponding sub text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub text; determining the processing priority of each of at least two synthesizing subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks. Because the target text is segmented into at least two sub-texts and at least two synthesis sub-tasks are generated, one target text can be delivered for multiple times according to the synthesis sub-tasks, and only one synthesis sub-task is completed, one section of corresponding audio data can be output, so that the real-time performance of voice synthesis is improved.
The embodiment of the present disclosure provides a speech synthesis apparatus for performing the speech synthesis method described in the above embodiment, and as shown in fig. 6, the speech synthesis apparatus 60 includes:
the segmentation module 601 is configured to respond to the received target text and perform segmentation processing on the target text to obtain at least two sub-texts corresponding to the target text;
a task module 602 configured to generate at least two synthesis subtasks based on the at least two sub-texts, where each synthesis subtask is used to instruct that the corresponding sub-text is subjected to speech synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub-text;
a priority module 603 configured to determine a processing priority of each of the at least two synthesis subtasks based on a request time of the target text and an order of the subtasks in the target text;
and the speech synthesis module 604 is configured to execute the synthesis subtask based on the processing priority to obtain the audio data corresponding to the synthesis subtask.
Optionally, in some embodiments of the present disclosure, the number of the target texts is at least two, and the priority module 603 is configured to obtain a delivery time of each of the at least two composite subtasks according to a request time of the target text to which the composite subtask belongs and an order of the subtasks in the target text, where the delivery time of each of the at least two composite subtasks is used to indicate a time for completing the composite subtask; processing priorities of at least two composition subtasks are determined based on the lead time.
Optionally, in some embodiments of the present disclosure, the priority module 603 is configured to, for each synthesis subtask, determine an offset time of the subtask corresponding to the synthesis subtask according to an order of the subtask in the target text, where the offset time is used to indicate an offset of a time at which the synthesis subtask starts to be executed relative to a request time of the target text to which the synthesis subtask belongs; and determining the delivery time of the synthesis subtask according to the request time of the target text to which the synthesis subtask belongs, the offset time of the subtask corresponding to the synthesis subtask, and the processing speed of the voice synthesis.
Optionally, in some embodiments of the present disclosure, the speech synthesis module 604 is further configured to output the audio data based on a delivery time of the synthesis subtask when the synthesis subtask is completed and the audio data of the synthesis subtask is obtained.
Optionally, in some embodiments of the present disclosure, the segmentation process includes word segmentation, and the task module 602 is configured to perform word segmentation on the target text to obtain m word groups, where m is an integer greater than 0; and determining at least two sub texts corresponding to the target text based on the m word groups.
Optionally, in some embodiments of the present disclosure, the segmentation processing further includes a reorganization, and the task module 602 is configured to reorganize m phrases to obtain n sub-texts based on a processing speed of speech synthesis, where n is an integer greater than 0, and n is not greater than m.
Optionally, in some embodiments of the present disclosure, the task module 602 is configured to, in response to determining that the cache database contains audio data corresponding to the sub-text, obtain the audio data from the cache database; and in response to determining that the cache database does not contain audio data corresponding to the sub-text, generating a synthesis subtask for the sub-text; at least two composition subtasks are obtained.
Optionally, in some embodiments of the present disclosure, the speech synthesis module 604 is further configured to store the audio data corresponding to the synthesis subtasks in a cache database.
Optionally, in some embodiments of the present disclosure, the speech synthesis module 604 is configured to, after at least one of the at least two synthesis subtasks of the target text is completed to obtain at least one piece of audio data, perform audio splicing on the at least one piece of audio data of the target text to obtain a synthesized audio, and output the synthesized audio.
Optionally, in some embodiments of the present disclosure, the speech synthesis module 604 is configured to issue the at least two synthesis subtasks to the at least two computing nodes, respectively, based on the processing priorities, so that the at least two computing nodes execute the synthesis subtasks in parallel, and receive the audio data sent by the computing nodes.
The voice synthesis device disclosed by the embodiment of the disclosure is used for responding to the received target text, and segmenting the target text to obtain at least two sub-texts corresponding to the target text; generating at least two synthesis subtasks based on the at least two sub texts, wherein each synthesis subtask is used for indicating that the corresponding sub text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub text; determining the processing priority of each of at least two synthesizing subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks. Because the target text is segmented into at least two sub-texts and at least two synthesis sub-tasks are generated, one target text can be delivered for multiple times according to the synthesis sub-tasks, and only one synthesis sub-task is completed, one section of corresponding audio data can be output, so that the real-time performance of voice synthesis is improved.
Based on the speech synthesis method described in the foregoing embodiment, an embodiment of the present disclosure provides an electronic device for executing the speech synthesis method described in the foregoing embodiment, and as shown in fig. 7, the electronic device 70 includes: at least one processor (processor)702, memory (memory)704, bus 706, and communication Interface (Communications Interface) 708.
Wherein:
the processor 702, communication interface 708, and memory 704 communicate with one another via a communication bus 706.
A communication interface 708 for communicating with other devices.
The processor 702 is configured to execute the program 710, and may specifically execute the relevant steps in the method described in the foregoing embodiment.
In particular, the program 710 may include program code that includes computer operating instructions.
The processor 702 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present disclosure. The electronic device comprises one or more processors, which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.
And a memory 704 for storing a program 710. The memory 704 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
The electronic equipment of the embodiment of the disclosure, in response to the received target text, performs segmentation processing on the target text to obtain at least two sub-texts corresponding to the target text; generating at least two synthesis subtasks based on the at least two sub texts, wherein each synthesis subtask is used for indicating that the corresponding sub text is subjected to voice synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub text; determining the processing priority of each of at least two synthesizing subtasks based on the request time of the target text and the sequence of the subtasks in the target text; and executing the synthesis subtasks based on the processing priority to obtain the audio data corresponding to the synthesis subtasks. Because the target text is segmented into at least two sub-texts and at least two synthesis sub-tasks are generated, one target text can be delivered for multiple times according to the synthesis sub-tasks, and only one synthesis sub-task is completed, one section of corresponding audio data can be output, so that the real-time performance of voice synthesis is improved.
Based on the speech synthesis method described in the above embodiments, the embodiments of the present disclosure provide a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, causes the processor to implement the speech synthesis method as described in the embodiments of the present disclosure.
The speech synthesis apparatus of the embodiments of the present disclosure exists in various forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.
(4) And other electronic equipment with data interaction function.
Thus, particular embodiments of the present subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may be advantageous.
The apparatuses and modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. The functionality of the various elements may be implemented in the same one or more software and/or hardware implementations in practicing the disclosure.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The disclosure may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular transactions or implement particular abstract data types. The present disclosure may also be practiced in distributed computing environments where transactions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present disclosure and is not intended to limit the present disclosure. Various modifications and variations of this disclosure will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the scope of the claims of the present disclosure.

Claims (13)

1. A method of speech synthesis, comprising:
in response to a received target text, performing segmentation processing on the target text to obtain at least two sub-texts corresponding to the target text;
generating at least two synthesis subtasks based on the at least two sub-texts, wherein each synthesis subtask is used for indicating that the corresponding sub-text is subjected to speech synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub-text;
determining a processing priority for each of the at least two composition subtasks based on a request time for the target text and an order of the subtasks in the target text;
and executing the synthesis subtask based on the processing priority to obtain audio data corresponding to the synthesis subtask.
2. The method of claim 1, wherein the number of the target text is at least two, and wherein determining the processing priority of each of the at least two synthesis subtasks based on the request time of the target text and the order of the subtasks in the target text comprises:
obtaining the delivery time of each composite subtask in the at least two composite subtasks according to the request time of the target text to which the composite subtask belongs and the sequence of the subtasks in the target text, wherein the delivery time of each composite subtask is used for indicating the time for completing the composite subtask;
determining a processing priority of the at least two composition subtasks based on the lead time.
3. The method of claim 2, wherein obtaining the delivery time of each of the at least two composite subtasks according to the request time of the target text to which the composite subtask belongs and the order of the subtasks in the target text comprises:
for each synthesis subtask, determining offset time of a subtask corresponding to the synthesis subtask according to the sequence of the subtask in the target text, where the offset time is used to indicate an offset of time at which the synthesis subtask starts to execute relative to request time of the target text to which the synthesis subtask belongs;
and determining the delivery time of the synthesis subtask according to the request time of the target text to which the synthesis subtask belongs, the offset time of the subtask corresponding to the synthesis subtask, and the processing speed of the speech synthesis.
4. The method of claim 2, further comprising:
and outputting the audio data based on the delivery time of the synthesis subtask when the synthesis subtask is completed and the audio data of the synthesis subtask is obtained.
5. The method of claim 1, wherein the segmentation process comprises word segmentation,
and wherein the segmenting the target text to obtain at least two sub-texts corresponding to the target text comprises:
performing word segmentation on the target text to obtain m word groups, wherein m is an integer larger than 0;
and determining at least two sub texts corresponding to the target text based on the m word groups.
6. The method of claim 5, wherein said segmentation process further comprises regrouping,
and wherein the determining at least two sub-texts corresponding to the target text based on the m word groups comprises:
and recombining the m phrases based on the processing speed of the voice synthesis to obtain n sub-texts, wherein n is an integer larger than 0 and is not larger than m.
7. The method of claim 1, wherein generating at least two composition subtasks based on the at least two sub-documents comprises:
responding to the fact that the audio data corresponding to the sub-texts are contained in a cache database, and obtaining the audio data from the cache database; and the number of the first and second groups,
in response to determining that the cache database does not contain the audio data corresponding to the sub-text, generating a synthesis subtask for the sub-text; and obtaining the at least two synthesis subtasks.
8. The method of claim 7, wherein after the synthesizing subtask is executed based on the processing priority, and audio data corresponding to the synthesizing subtask is obtained, the method further comprises:
and storing the audio data corresponding to the synthesis subtasks in the cache database.
9. The method of claim 1, wherein after the synthesizing subtask is executed based on the processing priority, and audio data corresponding to the synthesizing subtask is obtained, the method further comprises:
and in the at least two synthesis subtasks of the target text, after at least one synthesis subtask is completed to obtain at least one audio data, performing audio splicing on the at least one audio data of the target text to obtain a synthesized audio, and outputting the synthesized audio.
10. The method according to any of claims 1-9, wherein said executing the synthesis subtask based on the processing priority to obtain the audio data corresponding to the synthesis subtask comprises:
and respectively issuing at least two synthesizing subtasks to at least two computing nodes based on the processing priority, so that the at least two computing nodes execute the synthesizing subtasks in parallel and receive the audio data sent by the computing nodes.
11. A speech synthesis apparatus, comprising:
the segmentation module is configured to respond to a received target text, and segment the target text to obtain at least two sub-texts corresponding to the target text;
the task module is configured to generate at least two synthesis subtasks based on the at least two sub-texts, wherein each synthesis subtask is used for indicating that the corresponding sub-text is subjected to speech synthesis to obtain corresponding audio data, and one synthesis subtask corresponds to one sub-text;
a priority module configured to determine a processing priority of each of the at least two synthesis subtasks based on a request time of the target text and an order of the subtasks in the target text;
and the voice synthesis module is configured to execute the synthesis subtask based on the processing priority to obtain audio data corresponding to the synthesis subtask.
12. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory for storing a plurality of data to be transmitted,
wherein the memory stores at least one program that, when executed by the at least one processor, causes the at least one processor to implement the speech synthesis method of any one of claims 1-10.
13. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the speech synthesis method according to any one of claims 1-10.
CN202110640245.0A 2021-06-09 2021-06-09 Speech synthesis method, apparatus and computer readable storage medium Active CN113096637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110640245.0A CN113096637B (en) 2021-06-09 2021-06-09 Speech synthesis method, apparatus and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110640245.0A CN113096637B (en) 2021-06-09 2021-06-09 Speech synthesis method, apparatus and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113096637A true CN113096637A (en) 2021-07-09
CN113096637B CN113096637B (en) 2021-11-02

Family

ID=76664563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110640245.0A Active CN113096637B (en) 2021-06-09 2021-06-09 Speech synthesis method, apparatus and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113096637B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898733A (en) * 2022-05-06 2022-08-12 深圳妙月科技有限公司 AI voice data analysis processing method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106513A1 (en) * 2005-11-10 2007-05-10 Boillot Marc A Method for facilitating text to speech synthesis using a differential vocoder
US10546573B1 (en) * 2014-03-21 2020-01-28 Amazon Technologies, Inc. Text-to-speech task scheduling
CN111883100A (en) * 2020-07-22 2020-11-03 马上消费金融股份有限公司 Voice conversion method, device and server
CN112581934A (en) * 2019-09-30 2021-03-30 北京声智科技有限公司 Voice synthesis method, device and system
CN112863479A (en) * 2021-01-05 2021-05-28 杭州海康威视数字技术股份有限公司 TTS voice processing method, device, equipment and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106513A1 (en) * 2005-11-10 2007-05-10 Boillot Marc A Method for facilitating text to speech synthesis using a differential vocoder
US10546573B1 (en) * 2014-03-21 2020-01-28 Amazon Technologies, Inc. Text-to-speech task scheduling
CN112581934A (en) * 2019-09-30 2021-03-30 北京声智科技有限公司 Voice synthesis method, device and system
CN111883100A (en) * 2020-07-22 2020-11-03 马上消费金融股份有限公司 Voice conversion method, device and server
CN112863479A (en) * 2021-01-05 2021-05-28 杭州海康威视数字技术股份有限公司 TTS voice processing method, device, equipment and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898733A (en) * 2022-05-06 2022-08-12 深圳妙月科技有限公司 AI voice data analysis processing method and system

Also Published As

Publication number Publication date
CN113096637B (en) 2021-11-02

Similar Documents

Publication Publication Date Title
WO2015098306A1 (en) Response control device and control program
JP6971292B2 (en) Methods, devices, servers, computer-readable storage media and computer programs for aligning paragraphs and images
US10149239B2 (en) System and method for the reception of content items
CN108877800A (en) Voice interactive method, device, electronic equipment and readable storage medium storing program for executing
CN113096637B (en) Speech synthesis method, apparatus and computer readable storage medium
CN112509562A (en) Method, apparatus, electronic device and medium for text post-processing
WO2015106646A1 (en) Method and computer system for performing audio search on social networking platform
CN112735372A (en) Outbound voice output method, device and equipment
CN113407767A (en) Method and device for determining text relevance, readable medium and electronic equipment
CN113079201B (en) Information processing system, method, device and equipment
CN111508478A (en) Speech recognition method and device
CN110909527A (en) Text processing model operation method and device, electronic equipment and storage medium
CN112446208A (en) Method, device and equipment for generating advertisement title and storage medium
JP2021108095A (en) Method for outputting information on analysis abnormality in speech comprehension
CN111460211A (en) Audio information playing method and device and electronic equipment
CN113191257B (en) Order of strokes detection method and device and electronic equipment
CN115422928A (en) Message generation method and device, storage medium and electronic equipment
CN111652002B (en) Text division method, device, equipment and computer readable medium
CN114566173A (en) Audio mixing method, device, equipment and storage medium
KR20220137939A (en) Unsupervised Singing Speech Through a Pitch Hostile Network
CN109814916B (en) IVR flow configuration method, device, storage medium and server
CN113360704A (en) Voice playing method and device and electronic equipment
CN112820280A (en) Generation method and device of regular language model
CN108734149B (en) Text data scanning method and device
CN112017685A (en) Voice generation method, device, equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant