CN111105779A

CN111105779A - Text playing method and device for mobile client

Info

Publication number: CN111105779A
Application number: CN202010000741.5A
Authority: CN
Inventors: 胡帅君; 李世龙; 林喜; 闫腾; 李明辉
Original assignee: Databaker Beijng Technology Co ltd
Current assignee: Beibei (Qingdao) Technology Co.,Ltd.
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-05-05
Anticipated expiration: 2040-01-02
Also published as: CN111105779B

Abstract

The embodiment of the invention provides a text playing method and device for a mobile client, the mobile client and a storage medium, wherein a text comprises a plurality of text sentences, and the method comprises the following steps: playing a first playing voice determined according to a first synthesized voice synthesized by the server in real time, wherein the first synthesized voice corresponds to a first text sentence in the text playing request; when the first playing voice starts to be played, sending a next text sentence after the first text sentence to the server so that the server can synthesize a second synthesized voice corresponding to the next text sentence in real time; receiving a second synthesized voice returned by the server; determining a second played voice based on the second synthesized voice; storing the second played voice into a playlist, wherein the playlist is used for storing the played voices in sequence; and playing the second playing voice under the condition that all the playing voices determined according to the first synthesized voice are played. The scheme realizes real-time and uninterrupted play when the text with longer space is subjected to voice synthesis.

Description

Text playing method and device for mobile client

Technical Field

The present invention relates to the technical field of text-to-speech (TTS), and more particularly, to a text playing method and apparatus for a mobile client, and a storage medium.

Background

The text-to-speech technology is a technology for converting text information into voice information. The text-to-speech technology can provide speech synthesis service for a large number of users and third-party applications. In combination with voice synthesis service, some landing application scenes gradually emerge in the market, for example, a user realizes telling a story to a baby by using own voice and navigates by using own voice by using a voice synthesis technology.

In consideration of limitations of the mobile client in terms of network connection, storage resources and the like, most of the existing scenes in which the mobile client plays the voice synthesized in real time by using the voice synthesis technology are situations of playing a single sentence, for example, man-machine interaction is performed by intelligent voice assistants such as siri, lovely scholar and the like. The man-machine interaction is usually performed in the form of question answering. After receiving the inquiry every time, the intelligent voice assistant searches the corresponding text sentence from the database, converts the searched text sentence into the corresponding synthesized voice through the real-time voice synthesis technology and then broadcasts the synthesized voice. For the situation that a long text needs to be played at a mobile client, the prior art generally obtains a synthesized voice corresponding to the entire text through a voice synthesis technology, and then plays the synthesized voice based on a play request.

For a text with a long length, how to play the speech on the mobile client in real time and uninterruptedly while performing speech synthesis becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The present invention has been made in view of the above problems.

According to an aspect of the present invention, there is provided a text playing method for a mobile client, where the text includes a plurality of text sentences, the method including:

s100, playing first playing voice determined according to first synthesized voice synthesized by a server in real time, wherein the first synthesized voice corresponds to a first text sentence in a text playing request;

s200, when the first playing voice starts to be played, sending a next text sentence after the first text sentence to the server so that the server can synthesize a second synthesized voice corresponding to the next text sentence in real time;

s300, receiving the second synthesized voice returned by the server;

s400, determining a second playing voice based on the second synthesized voice;

s500, storing the second playing voice into a play list, wherein the play list is used for storing the playing voice in sequence;

s600, playing the second playing voice under the condition that all the playing voices determined according to the first synthesized voice are played.

Exemplarily, the step S400 includes: converting the second synthesized voice into one or more second played voices with fixed duration;

the step S500 includes: sequentially storing the one or more second played voices into corresponding positions of the playlist;

the method further comprises the following steps:

calculating the played time length according to the position of the currently played second playing voice in the play list;

determining the total playing time length of the text;

determining the playing progress of the text based on the ratio of the played time length to the total playing time length;

determining the caching progress of the text based on the ratio of the caching duration to the total playing duration;

and displaying the playing progress and the caching progress while playing the second playing voice.

Illustratively, the converting the second synthesized speech into one or more second played speeches having a fixed duration comprises:

acquiring the sampling frequency of the second synthesized voice;

calculating the data volume of the voice with the fixed duration based on the sampling frequency;

and dividing the second synthesized voice according to the data volume to obtain a second played voice.

Illustratively, the dividing the second synthesized speech according to the data amount to obtain a second played speech includes:

segmenting the second synthesized voice according to the data volume from the initial position of the second synthesized voice until the remaining voice is less than the data volume;

adding a mute section to the remaining speech to supplement the duration of the remaining speech to the fixed duration and to use the remaining speech as one of the second played speech.

Illustratively, the determining the total playing time length of the text comprises:

judging whether the synthesized voice corresponding to all text sentences of the text is received at present;

in the case where the synthesized speech corresponding to all text sentences has not been received at present,

acquiring the total word number of the text and the sound speed of the received synthesized voice;

multiplying the total word number and the sound speed to obtain the product as the total playing time of the text;

in the case where synthesized speech corresponding to all text sentences has been currently received,

and determining the total playing time length of the text according to the time lengths of the synthesized voice corresponding to all the text sentences.

Illustratively, the displaying the playing progress while playing the second playing voice further comprises: updating the playing progress based on a fixed frequency.

Illustratively, the method further comprises: and responding to the instruction for adjusting the playing progress, and re-determining and playing the playing voice to be played currently.

Illustratively, the re-determining and playing the playing voice to be played currently in response to the instruction for adjusting the playing progress includes:

acquiring the adjustment progress in the instruction for adjusting the playing progress;

judging whether the play list contains play voice meeting the adjustment progress or not;

if so, determining the playing voice meeting the adjustment progress as the playing voice to be played currently and playing;

if not, the currently played playing voice is not changed.

Illustratively, the method further comprises:

acquiring background sounds corresponding to background objects in the text playing request;

segmenting the background sound into background units with the fixed duration;

the converting the second synthesized voice into one or more second played voices having a fixed duration comprises:

segmenting the second synthesized speech into one or more synthesis units having the fixed duration;

and combining the synthesis unit with the corresponding background unit to generate the second playing voice.

Illustratively, said merging the synthesis unit with the corresponding background unit to generate the second played speech comprises:

determining respective weights of the synthesis unit and the corresponding background unit;

and weighting and summing the synthesis unit and the corresponding background unit according to the respective weights to obtain the second played voice.

Illustratively, the second played back voice is data in a PCM encoding format, and the fixed duration of the second played back voice is 1 to 2 seconds.

Illustratively, the steps S100 to S600 are executed when the play type specified in the text play request is personalized play;

the method further comprises the following steps:

and under the condition that the playing type appointed in the text playing request is original sound playing, playing the standard voice which is synthesized in advance by the server and corresponds to the text.

According to another aspect of the present invention, there is also provided a text playing apparatus for a mobile client, including:

the first playing unit is used for playing first playing voice determined according to first synthesized voice synthesized by the server in real time, wherein the first synthesized voice corresponds to a first text sentence in the text playing request;

a text sentence sending unit, configured to send a next text sentence after the first text sentence to the server while starting to play the first played voice, so that the server synthesizes, in real time, a second synthesized voice corresponding to the next text sentence;

a voice receiving unit, configured to receive the second synthesized voice returned by the server;

a voice determination unit configured to determine a second played voice based on the second synthesized voice;

a storage unit, configured to store the second played voice in a playlist, where the playlist is configured to store the played voices in sequence;

and the second playing unit is used for playing the second playing voice under the condition that all the playing voices determined according to the first synthesized voice are played.

According to still another aspect of the present invention, there is also provided a mobile client, including: a processor and a memory, wherein the memory has stored therein computer program instructions for executing the text playback method described above when executed by the processor.

According to yet another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the above-described text playback method when executed.

According to the technical scheme of the embodiment of the invention, the real-time and uninterrupted playing of the text with longer space is realized during the voice synthesis. The server carries out real-time voice synthesis by sequentially sending the text sentences in the text to the server, so that the problem that the server has too long waiting time of a user due to the synthesis of large text can be avoided. The continuous playing of the text can be realized by starting to play the voice corresponding to the previous text sentence and simultaneously sending the next text sentence to the server for voice synthesis, and storing the voice corresponding to the next text sentence into the play list.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic flow diagram of a text playback method for a mobile client according to one embodiment of the present invention;

FIG. 2 shows a schematic flow diagram for providing a play progress of text according to one embodiment of the invention;

FIG. 3 shows a schematic flow diagram for converting a second synthesized voice into a second played voice according to one embodiment of the present invention;

FIG. 4 shows a schematic flow chart diagram for determining a total length of play according to one embodiment of the present invention;

FIG. 5 shows a schematic flow diagram for adjusting playback progress according to one embodiment of the invention;

FIG. 6 is a diagram illustrating conversion of a second synthesized voice into a second played voice according to still another embodiment of the present invention;

FIG. 7 shows a schematic flow chart diagram of generating a second playback voice according to yet another embodiment of the present invention;

FIG. 8 shows an architectural diagram of a player according to one embodiment of the invention;

fig. 9 shows a schematic block diagram of a text playback apparatus according to an embodiment of the present invention;

fig. 10 shows a schematic block diagram of a mobile client according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

The method and the device are suitable for playing the scenes of the real-time voice synthesized text at the mobile client. The mobile client can be various intelligent devices such as a smart phone, a tablet computer and a notebook computer. The mobile client may be in wired or wireless communication with the server. The mobile client may play the speech synthesized by the server accordingly via communication with the server in response to a text play request of the user. The text may contain a plurality of text sentences. The text sentence may be a sentence ending in a period, a portion of a sentence ending in a punctuation mark such as a comma, a semicolon, or multiple sentences, etc. In summary, the text is not a short single sentence, but rather a short or long text with a certain length, such as a story, a poem, a book, etc.

Fig. 1 shows a schematic flow diagram of a text playback method for a mobile client according to one embodiment of the present invention. As shown in fig. 1, the text playback method includes the following steps.

And S100, playing first playing voice determined according to the first synthesized voice synthesized by the server in real time, wherein the first synthesized voice corresponds to a first text sentence in the text playing request.

As described above, the mobile client is configured to play the voice corresponding to the text in response to a text play request issued by the user. The text playing request may include information of a text to be played. And the mobile client sends a plurality of text sentences contained in the text to a remote server one by one in sequence for real-time voice synthesis based on the information of the text in the text playing request. The remote server performs speech synthesis in units of text sentences.

The first text sentence may be any text sentence except the last text sentence in the text, such as the first text sentence, the second text sentence, or the nth text sentence in the text. The first synthesized voice may be obtained by performing real-time voice synthesis on the first text sentence by the remote server, and corresponds to the first text sentence. The first played voice determined according to the first synthesized voice may be the first synthesized voice itself, or may be a voice obtained by processing the first synthesized voice.

In step S100, the mobile client plays a first played voice, which is determined according to a first synthesized voice received from the server and synthesized in real time.

And S200, when the first playing voice starts to be played, sending a next text sentence after the first text sentence to the server so as to enable the server to synthesize a second synthesized voice corresponding to the next text sentence in real time.

The next text sentence is a text sentence that is ordered after and adjacent to the first text sentence in the text. For example, when a first text sentence is a first sentence in text, the next text sentence is a second sentence in text; when the first text sentence is the second sentence in the text, the next text sentence is the third sentence in the text; by analogy, when the first text sentence is the nth text sentence in the text, the next text sentence is the (n + 1) th text sentence.

Those skilled in the art understand that a certain synthesis time is required for the server to synthesize a text sentence into corresponding voice in real time, and a certain playing time is required for the mobile client to play the synthesized voice. For the same text sentence, the synthesis time is far shorter than the playing time. Even different text sentences have approximately the same length, so that the composition time can be generally ensured to be far shorter than the playing time. Therefore, when the first playing voice is played, the next text sentence is sent to the server so that the server carries out voice synthesis on the next text sentence, and the server can carry out voice synthesis on the next text sentence before the playing voice determined according to the first synthesized voice is played. Therefore, for a text containing a plurality of text sentences, only before playing the first text sentence of the text, the speech of the text can be played by waiting for the synthesis time of the remote server for synthesizing the synthesized speech corresponding to the first text sentence in real time. After the voice playing is started, because the synthesis time of each subsequent text sentence is less than the playing time, the real-time voice synthesis of the next text sentence can be completed when the previous text sentence is started to be played, thereby avoiding the waiting of the synthesis time.

And S300, receiving the second synthesized voice returned by the server. The second synthesized voice is the voice synthesized by the server in real time and corresponding to the next text sentence sent in step S200.

And S400, determining a second playing voice based on the second synthesized voice. Illustratively, the second played speech is the second synthesized speech itself. Alternatively, the second played speech is speech obtained after processing the second synthesized speech. The processing operation may include a process of performing a filtering operation on the second synthesized speech, and the present invention is not limited to a specific processing operation.

And S500, storing the second playing voice into a play list, wherein the play list is used for storing the playing voice in sequence.

The above-described playlist may be used to sequentially cache played voices that the mobile client has determined. Illustratively, the playlist may be based on a first-in-first-out mechanism. The sequence may be based on a determined time at which the speech is played. For example, after the first played voice is determined from the first synthesized voice, the second played voice is determined from the second synthesized voice. Then the second played voice is stored after the first played voice in the playlist.

S600, under the condition that all the played voices determined according to the first synthesized voice are played, the second played voice is played.

Since the voices in the play list are stored in order, they can be played in order when played. That is, in the case where all the played voices determined based on the first synthesized voice are played, the second played voice determined based on the second synthesized voice is played next. It will be appreciated that continuous uninterrupted play can be achieved as long as there is unplayed speech in the playlist.

Through the embodiment, the invention can realize real-time and uninterrupted play when the text with longer space is subjected to voice synthesis. The server carries out real-time voice synthesis by sequentially sending the text sentences in the text to the server, so that the problem that the server has too long waiting time of a user due to the synthesis of large text can be avoided. The continuous playing of the text can be realized by starting to play the voice corresponding to the previous text sentence and simultaneously sending the next text sentence to the server for voice synthesis, and storing the voice corresponding to the next text sentence into the play list.

In the above step S400, the second played voice is determined based on the second synthesized voice. In one example, the above steps include: s410, converting the second synthesized voice into one or more second playing voices with fixed time length.

In this example, the second synthesized speech may be supplemented, divided, and so on, to obtain one or more second played speeches with fixed duration. The fixed time period may be any time period set as needed, for example, 1 second, 2 seconds, 5 seconds, or the like. It will be appreciated that the second synthesized speech has a particular pronunciation duration that is related to the number of words of the corresponding text sentence. When the pronunciation duration of the second synthesized voice is less than the fixed duration, silence with a certain duration can be supplemented on the basis of the second synthesized voice, so that the total duration of the second synthesized voice and the supplemented silence is equal to the fixed duration. When the pronunciation duration of the second synthesized voice is longer than the fixed duration, the second synthesized voice can be divided into a second played voice having the fixed duration. For example, assuming that the utterance duration of the second synthesized speech is 10 seconds, the second synthesized speech may be divided into 5 second played speeches having a fixed duration of 2 seconds, or the second synthesized speech may be divided into 10 second played speeches having a fixed duration of 1 second.

On the basis of step S410, the step S500 storing the second played voice in a playlist includes: s510, storing the one or more second playing voices in the corresponding positions of the play list in sequence.

As described above, the playlist is used to store the played voices in order. Accordingly, upon converting the second synthesized speech into one or more second played voices, the one or more second played voices may be further stored in the corresponding positions of the playlist in the order of their positions in the second synthesized speech. For example, the second played back voice converted from the second synthesized voice includes S1, S2, S3, and S4. The above-mentioned second playback voices S1, S2, S3, and S4 may be sequentially stored to the current last empty position in the playlist, respectively.

In a text playing method according to another embodiment of the present invention, a playing progress of a text can be provided to a user. Fig. 2 shows a schematic flowchart for providing a progress of a text in a text playing method according to another embodiment of the present invention. On the basis of step S410 and step S510, as shown in fig. 2, providing the progress of the playing of the text includes the following steps.

And S710, calculating the played time length according to the position of the currently played second played voice in the playlist.

The played time length refers to the time length of the voice of the currently played text which is played from the beginning sentence. Since the played voices in the playlist are stored in order, the played time length can be calculated by the position of the currently played second played voice in the playlist. For example, it is assumed that the played voices stored in the playlist are sorted by sequence numbers starting from 1 according to the storage position, and each sequence number corresponds to a played voice having a fixed duration. Then, the sequence number corresponding to the currently played second played voice is multiplied by the fixed time length, which is the currently played time length. For example, the sequence number of the second played voice currently played is 50, the fixed time length of each played voice is 2s, and the currently played time length is 50 × 2 — 100 s.

In addition, before step S710, the buffer duration may be calculated according to all played voices stored in the playlist. For example, assuming that a total of 100 played voices are currently stored in the playlist, and the fixed time length of each played voice is 2s, the current buffer time length is 100 × 2 — 200 s.

S720, determining the total playing time length of the text.

The total playing duration of the text refers to the duration required to play the complete text. It will be appreciated that the total duration of play is related to the number of words of the text. The more the number of words, the longer the total playing time; the fewer the number of words, the shorter the total duration of the play.

And S730, determining the playing progress of the text based on the ratio of the played time length to the total playing time length. The ratio represents the ratio of the duration of the played speech to the duration of the speech corresponding to the entire text, and therefore, it can represent the current playing progress of the text.

In addition, the caching progress of the text can be determined based on the ratio of the caching duration to the total playing duration. The ratio represents the ratio of the determined duration of the played voice to the duration of the voice corresponding to the whole text, and therefore, the ratio can represent the current buffering progress of the text.

And S740, displaying the playing progress while playing the second playing voice. Illustratively, the progress of the play may be displayed using an operable control of the human-machine interface, e.g., a slider bar. In addition, the playing progress and the buffering progress can be displayed while the second playing voice is played. Illustratively, the buffering progress may be displayed with a slider bar of a different color than the playing progress.

It is to be understood that although the above operation is performed for the second played voice, in practice, the above operation may be performed for each synthesized voice. By converting the synthesized speech into one or more played speeches having a fixed time length and sequentially storing the played speeches in a play list, it is advantageous to accurately and quickly calculate the play time length. Because the duration of each playing voice is fixed, the playing time period corresponding to each playing voice can be directly calculated as long as the position of the playing voice in the play list is obtained. Compared with directly storing the synthesized voice, storing the played voice with fixed time length can more simply and rapidly acquire the played time length information. Furthermore, the playing progress of the text is displayed based on the playing duration, so that a visual display effect can be provided for a user, and the user experience is improved.

As described above, step S410 converts the second synthesized voice into one or more second played voices having a fixed duration. FIG. 3 shows a schematic flow diagram for converting a second synthesized voice to a second played voice according to one embodiment of the present invention. The converting the second synthesized voice into one or more second played voices having a fixed duration includes the following steps.

S411, acquiring the sampling frequency of the second synthesized voice.

The sampling frequency is related to parameters when the server carries out voice synthesis and can be directly obtained through the server. For example, the sampling frequency of 16k monaural audio data in PCM format may be 16 kHz.

And S412, calculating the data volume of the voice with the fixed duration based on the sampling frequency.

The sampling frequency may be multiplied by a fixed time duration to obtain a data amount of the voice for the corresponding time duration. If the speech is binaural audio data, the product needs to be multiplied by 2. If the data is binaural audio data, the sampling frequency of a single channel is 16kHz, the fixed duration is 1 second, and each quantized sample value corresponds to an 8-bit binary code (i.e., occupies one byte), then the amount of data for each played voice is 16000 × 2 × 1-32000 bytes.

And S413, dividing the second synthesized voice according to the data volume to obtain a second played voice.

The data size of the speech with fixed duration, that is, the data size of each second played speech, may be known, and the second synthesized speech may be segmented according to the data size, so as to obtain one or more second played speech with fixed duration.

The data volume of each second playing voice is calculated based on the sampling frequency and the fixed time length, the second synthesized voice is segmented based on the data volume, the operation is easy, the accuracy can be improved, and the segmented second playing voice is ensured to strictly accord with the preset fixed time length.

It is understood that the pronunciation time length of the second synthesized speech and the fixed time length are not necessarily integral multiples, and the second synthesized speech cannot be exactly divided into a plurality of second played speeches with fixed time lengths. For example, the utterance duration of the second synthesized speech is 10 seconds, and the set fixed duration is 3 seconds. In this case, the second synthesized speech includes 3 second played speeches having a fixed duration of 3 seconds and a last played speech having a duration of 1 second, and the duration of the last played speech is not sufficient for the fixed duration. In this case, step S413 may include the following steps.

Firstly, the second synthesized voice is segmented according to the data volume from the initial position of the second synthesized voice until the remaining voice is less than the data volume. Each time the segmentation operation is performed, a second played voice can be obtained. For example, in the above example, for the second synthesized speech of 10 seconds, a slicing operation may be performed 3 times, each time slicing out 1 second played speech of fixed duration 3 seconds. Finally, a play voice with a duration of 1 second remains, which is less than the data size of 3 seconds.

Then, a mute section is added to the remaining voice to supplement the duration of the remaining voice to the fixed duration and to take the remaining voice as one of second played voices. In the above example, the last played voice with a duration of 1 second may be added with a 2-second mute section to extend its duration to 3 seconds. The supplemented voice may be the last second played voice of the aforementioned 10-second synthesized voice.

Therefore, the second synthesized voice with the pronunciation time length of 10 seconds can be converted into 4 second played voices with the fixed time length of 3 seconds, wherein the last second played voice contains 2 seconds of mute time length. Through the mode of increasing the mute time, each synthesized voice can be just converted into a plurality of second playing voices with fixed time, the subsequent statistical management of the playing voices is facilitated, the calculation amount is reduced, and the response speed is improved.

As described above, the total play duration of the text is determined in step S720. Fig. 4 shows a schematic flow chart for determining the total length of play according to an embodiment of the present invention. As shown in fig. 4, step S720 includes:

and S721, judging whether the synthesized voice corresponding to all the text sentences of the text is received currently.

For example, the text includes 10 text sentences, and this step is used to determine whether the mobile client has received the synthesized speech corresponding to the 10 text sentences respectively returned by the server.

S722, under the condition that the synthetic voices corresponding to all the text sentences are not received currently, acquiring the total word number of the text and the sound speed of the received synthetic voice, and turning to the step S723.

And S723, multiplying the total word number and the sound speed to obtain the product as the total playing time of the text.

If the synthesized voices corresponding to all the text sentences are not received, the current server does not complete the voice synthesis of all the text sentences, and at this time, the total playing duration of the text can be estimated according to the total word number of the text and the speed of the synthesized voices. The total word number of the text can be directly counted according to the text information in the text playing request, and the sound speed of the synthesized voice can be directly obtained according to the parameter returned by the server.

And multiplying the sound speed of the synthetic voice and the total word number of the text to obtain the estimated total playing time of the text. It will be appreciated that the total play-out time calculated according to the above method is not accurate. In consideration of factors such as prosody and pause of punctuation in text, the sound speed of the synthesized speech is not always constant, and therefore an error may exist between the estimated total playing time obtained by multiplying the sound speed by the number of words and the actual total playing time.

S724, under the condition that the synthesized voices corresponding to all the text sentences are received currently, determining the total playing time length of the text according to the time lengths of the synthesized voices corresponding to all the text sentences.

If the synthesized voices corresponding to all text sentences have been received, the total duration of all the synthesized voices can be determined. The total play time is determined, for example, based on the number of all played voices stored in the play list. The total duration of the playback obtained at this time is more accurate.

The steps can provide the total playing time with reference value under the condition that the synthesized voice corresponding to all the text sentences is not received, and provide the accurate total playing time under the condition that the synthesized voice corresponding to all the text sentences is received. Based on this, the user can know the playing time length information related to the text at any time without waiting, so that more options can be obtained according to the total playing time length, and the listening experience of the user is improved.

In one example, the step S740, displaying the playing progress and the buffering progress while playing the second playing voice further includes: and updating the playing progress and the caching progress based on the fixed frequency. Illustratively, the play progress and the buffer progress may be updated every second. Therefore, the estimated total playing time can be replaced by the accurate total actual playing time to determine more accurate playing progress, the position of the playing progress bar can be adjusted in time according to different playing voices played currently, and the position of the cache progress bar can be adjusted in time according to all playing voices cached currently, so that a user can know the current playing progress and the cache progress more accurately, and experience is improved.

As described above, the present invention displays the progress of the play while playing the voice. Based on this, the invention can also comprise a step of adjusting the playing progress. In one example, the text playing method further includes:

and S800, responding to the instruction for adjusting the playing progress, and re-determining and playing the playing voice to be played currently.

Therefore, the user can flexibly set the playing position according to the actual requirement, for example, the important part of the voice is repeatedly played or the unimportant part of the voice is skipped to save time, and the user is given more independent selection rights.

Fig. 5 shows an illustrative flow diagram for adjusting the progress of the playback according to one embodiment of the invention. As shown in fig. 5, step S800 includes:

and S810, acquiring the adjustment progress in the adjustment playing progress instruction.

The adjustment progress of the adjustment progress instruction may be represented by a playing time, which may be a time manually input by the user or a time corresponding to a progress block on the dragged progress bar.

S820, judging whether the play list contains play voice meeting the adjustment progress.

As described above, the play voices having a fixed time length are sequentially stored in the play list. Therefore, the playing time corresponding to each playing voice can be calculated according to the position of each playing voice in the playlist and the fixed time length. For example, for a played voice with a sequence number of 20 and a fixed duration of 1 second stored in the storage list, the corresponding playing time is 20 × 1 to 20 seconds, that is, the voice is played from the 20 th second.

And judging whether the play list contains the play voice meeting the adjustment progress, wherein the judgment can be to judge whether the play list stores the play voice corresponding to the play time. For example, the expected playing time is 30 th second, and still taking the fixed duration of each voice as 1 second as an example, if 35 playing voices have been stored in the playlist, it can be understood that there is necessarily a playing voice with a playing time of 30 seconds, that is, the playlist contains playing voices meeting the adjustment progress. If only 20 playing voices are stored in the play list, it is indicated that the subsequent playing voices are still in the voice synthesis or transmission process and are not stored in the play list, and at this time, the play list does not contain the playing voices meeting the adjustment progress.

And S830, if the play list contains the play voice meeting the adjustment progress, determining the play voice meeting the adjustment progress as the play voice to be played currently and playing the play voice.

In this case, the playing voice meeting the adjustment progress, for example, the 30 th second voice in the above text, can be directly played, so that the purpose of adjusting the currently played playing voice according to the instruction of adjusting the playing progress issued by the user is achieved.

And S840, if the play list does not contain the play voice meeting the adjustment progress, the play voice played currently is not changed.

If there is no playing voice satisfying the adjustment progress in the playlist, in order not to affect the playing effect, the playing voice currently played is not changed in this example. For example, the currently played sound is the played sound of the 5 th second corresponding to the text, and the played sound satisfying the adjustment progress is the played sound of the 30 th second. If the playing voice with the playing time of 30 seconds is not stored in the playlist, the playing voice of the 5 th second is still continuously played at present, so that the fluency of the currently played content is ensured.

In the example, the playing voice to be played currently is determined according to whether the playing voice meeting the playing progress is stored in the playlist, so that on one hand, the playing content can be skipped according to the user instruction under the condition of meeting the condition, and the listening experience of the user is improved; on the other hand, the fluency of the played voice is still kept under the condition that the condition is not met, and unnecessary interference is avoided.

In one example, the played voice is data in a PCM encoding format, and the fixed duration of the played voice is 1 to 2 seconds. PCM encoding is the highest fidelity level of audio data encoding scheme currently available in computer applications. Compared with the data in the MP3 format, the data in the PCM coding format is not processed by any packaging compression and the like, so that the decompression process can be omitted when the data in the PCM coding format is played, and the playing efficiency is improved.

By setting the fixed duration of the played voice to 1 to 2 seconds, the calculation can be further simplified, and the statistics is convenient. Especially, when the fixed duration of playing the voice is set to 1 second, the sequence number of the playing voice stored in the playlist is the playing time period corresponding to the playing voice, and statistics is easier to be performed when the playing progress is displayed or adjusted. For example, when a progress adjustment instruction is received, whether the voice corresponding to the sequence number is stored in the playlist is detected directly according to the playing time included in the progress adjustment instruction, so that the detection efficiency can be improved, and the response time can be shortened. In addition, the fixed duration of playing the voice is set to be 1 second, and the technical effects of reducing the voice granularity and improving the positioning accuracy in progress adjustment are achieved. It can be understood that each time the adjustment progress is played from the starting position of the played voice corresponding to the adjustment progress, and therefore the fixed duration of the played voice determines the minimum granularity when the adjustment progress is made. If the fixed time length is 2 seconds, the adjusted playing time is a multiple of 2; if the fixed time length is 5 seconds, the adjusted playing time is a multiple of 5; and so on. The fixed time duration is set to be 1 second, and the voice unit corresponding to any second can be played according to the adjustment progress, so that the positioning is more accurate, and the high-standard positioning requirement of a user can be met, for example, a certain character or a certain word can be accurately positioned.

In an example, the text playing request sent by the user may further include a background object, and a background sound corresponding to the background object may be added to the playing speech, so that the background sound is played while the playing speech of the text is played. According to an embodiment of the present invention, the method for playing the text may further include the following steps.

And S900, acquiring background sounds corresponding to the background objects in the text playing request.

The background object may be descriptive information about a background sound, such as a rain cry, a cry out, music 1, etc., and the background sound may be an audio file corresponding to the background object, which may be stored on the server or on the mobile client. The obtaining of the corresponding background sound in this step is to obtain the corresponding audio file from the corresponding storage address.

And S1000, segmenting the background sound into background units with the fixed duration.

The step of dividing the background sound into background units may include, similar to steps S411 to S413: acquiring the sampling frequency of the background sound; calculating the data volume of the background unit with the fixed duration based on the sampling frequency of the background sound; and segmenting the background sound according to the data volume to obtain the background unit. The specific details are not set forth in detail.

On the basis of S900 and S1000, the aforementioned step S410 may include the following steps.

The second synthesized speech is segmented into one or more synthesis units having the fixed duration S416.

This step is similar to steps S411 to S413 and will not be described in detail here.

S417, combining the synthesis unit with the corresponding background unit to generate the second playing voice.

The combining in this step may be to superimpose the synthesis unit and the background unit, so that the second played voice generated after the superimposing includes both the background sound and the second synthesized voice. In order to ensure that the combination unit and the background unit can be combined smoothly, the fixed time lengths of the combination unit and the background unit can be set to be equal, that is, the fixed time length of the combination unit is equal to the fixed time length of the background unit. It can be understood that when the second playing voice is played, the corresponding background unit and the synthesis unit are played at the same time, so that the effect of adding the background sound on the basis of the second synthesized voice can be achieved.

FIG. 6 illustrates a schematic diagram of converting synthesized speech to played speech according to one embodiment of the present invention. As shown in fig. 6, the third line represents a background unit obtained by dividing the background sound, the second line represents a synthesis unit obtained by dividing the synthesized speech, and the first line represents the played speech generated from the background unit and the synthesis unit. The fixed time duration corresponding to the background unit and the synthesis unit is 1 second, and the corresponding data size is 32000 bytes. As can be seen from fig. 6, the combination of the background unit and the synthesis unit results in a played speech. In the example of fig. 6, a segment of beginning speech is also included at the start position in the played speech. The title speech may include product introduction, call words, and promotion documents, which are set by the manufacturer, but the invention is not limited thereto.

By combining the background sound and the synthesized voice, the background sound can be played while the synthesized voice corresponding to the text is played, so that the playing effect of the text is richer and more vivid. The user can also add different background sounds to the synthesized voice according to weather, mood, personal preference and the like, so that personalized playing is performed, and the listening experience of the user is improved.

As described above, in step S417, the synthesis unit is merged with the corresponding background unit to generate the played voice. FIG. 7 shows a schematic flow diagram for generating playback speech according to one embodiment of the present invention. As shown in fig. 7, step S417 includes:

s4171, determining respective weights of the synthesis unit and the corresponding background unit.

The weights of the synthesis unit and the background unit may be set according to different requirements. The weight of the synthesis unit and the weight of the background unit may be set to equal values, i.e. 0.5 respectively; or, the weight of the synthesis unit is greater than that of the background unit, so as to emphasize the synthesized voice, for example, the weight of the synthesis unit is 0.6, and the weight of the background unit is 0.4; in some special application scenarios, the weight of the synthesis unit may be smaller than the weight of the background unit, for example, the weight of the synthesis unit is 0.3, and the weight of the background unit is 0.7. The invention is not limited in this regard.

S4172, according to the respective weights, the synthesis unit and the corresponding background unit are weighted and summed to obtain the played voice.

For example, the synthesis unit is denoted by T1, and the corresponding weight is a 1; the background cell is denoted by B1 and is correspondingly weighted by a 2. The corresponding second played voice may be represented by T1 a1+ B1 a 2.

The synthesis unit and the background unit are weighted and summed based on different weights to obtain the played voice, the operation is easy, the respective ratio of the background voice and the synthesized voice can be controlled quantitatively, and the playing effect of the played voice is ensured.

It will be appreciated by those skilled in the art that existing speech synthesis techniques may generate synthesized speech having personalized sounds, such as magnetic male voices, cognitive female voices, baby boys voices, or sounds of specific characters such as donald ducks, cherry meatballs, or even own voices. Steps S100-S600 in the text playing method proposed in the foregoing according to the embodiment of the present invention may be used in a case where the playing type specified by the user in the text playing request is personalized playing. That is, when a user requests a personalized play of a text through a personalized sound, the play of the text is performed based on online speech synthesis to synthesize a synthesized speech satisfying a personalized demand in real time.

In addition to synthesizing personalized synthesized speech corresponding to the text in real time, standard speech corresponding to the text may be synthesized in advance. The tone of the standard speech may be a relatively common man's middle voice or a woman's middle voice, etc., and is referred to herein as an original voice for brevity. In addition to personalized playback, according to embodiments of the present invention, original sound playback may also be performed. The played sound can be pre-synthesized standard voice, and the URL address of the standard voice is provided to the mobile client by the server. If the playing type specified by the user in the text playing request is original sound playing, the mobile client can acquire pre-synthesized standard voice for playing according to the received URL address corresponding to the text. In one example, the presynthesized standard speech is data in the MP3 format.

The method plays the pre-synthesized standard voice without on-line voice synthesis, has the advantages of high response speed and high playing fluency, and can ensure that the playing process of the text is more stable and reliable.

The text playing method can be realized by using a player APP on the mobile client. Fig. 8 shows an architecture diagram of a player according to an embodiment of the invention. As shown, the player may include the following modules.

And the sound switching module is used for providing different playing sound options for the user during text playing, such as original sound playing or personalized playing. The original sound playing can be played by using a standard voice (namely, MP3 original sound) in which an MP3 format is synthesized in advance, and the personalized playing can be played by using personalized sound (namely, PCM TTS synthesized sound) in a PCM format different from the standard voice. When the user selects the personalized playing, online real-time voice synthesis needs to be carried out through the remote server.

And the playlist module is used for storing the URL address of the MP3 original sound file to be played or the PCM playing voice determined after real-time synthesis by the server.

And the control module is used for interacting with the remote server and maintaining the contents of playing text switching, playing sound switching, playing progress information and the like.

And the graphical user interface is used for displaying corresponding pictures and contents such as playing states, playing schedules and the like.

The MP3 playing module is used for playing MP3 format standard voice, and the PCM module is used for playing the playing voice determined according to real-time synthesized PCM voice.

The workflow of each module is explained as follows:

the sound switching module receives a playing sound type selected by a user, if the playing sound type is an MP3 original sound, the URL address of the corresponding MP3 original sound file is searched from the play list, and the corresponding MP3 original sound file is played through the MP3 playing module based on the URL address;

if the user selects the PCM TTS synthesized voice, the control module sends a first text sentence in the text to the remote server so that the remote server can synthesize a first synthesized voice corresponding to the first text sentence in real time;

the control module receives the first synthesized voice, determines one or more first played voices according to the first synthesized voice, stores the first played voices to a play list in sequence, and starts to play the first played voices through the PCM playing module;

when the first playing voice starts to be played, the control module sends a second text sentence to the server so that the server can synthesize a second synthesized voice corresponding to the second text sentence in real time;

the control module receives the second synthesized voice returned by the server;

the control module determines a second playing voice based on the second synthesized voice;

the control module stores the second playing voice to a play list;

and the PCM playing module plays the second playing voice under the condition that all the playing voices determined according to the first synthesized voice are played.

According to another aspect of the present invention, there is also provided a playing device for text of a mobile client. Fig. 9 shows a schematic block diagram of a playback apparatus of a text according to an embodiment of the present invention.

As shown in fig. 9, the apparatus 900 includes a first text playing unit 910, a text sentence transmitting unit 920, a voice receiving unit 930, a voice determining unit 940, a storage unit 950, and a second playing unit 960.

The respective modules may respectively perform the respective steps/functions of the text playing method for the mobile client described above. Only the main functions of the components of the device 900 are described below, and details that have been described above are omitted.

A first playing unit 910, configured to play a first playing voice determined according to a first synthesized voice synthesized by a server in real time, where the first synthesized voice corresponds to a first text sentence in a text playing request;

a text sentence sending unit 920, configured to send a next text sentence after the first text sentence to the server while starting to play the first played voice, so that the server synthesizes, in real time, a second synthesized voice corresponding to the next text sentence;

a voice receiving unit 930, configured to receive the second synthesized voice returned by the server;

a voice determination unit 940 for determining a second played voice based on the second synthesized voice;

a storage unit 950 for storing the second played voice into a playlist, wherein the playlist is used for storing played voices in order;

a second playing unit 960, configured to play the second played voice when all the played voices determined according to the first synthesized voice are played.

Fig. 10 shows a schematic block diagram of a mobile client 1000 according to one embodiment of the present invention. As shown in fig. 10, the mobile client 1000 includes an input device 1010, a storage device 1020, a processor 1030, and an output device 1040.

The input device 1010 is used for receiving an operation instruction input by a user and collecting data. The input device 1010 may include one or more of a keyboard, a mouse, a microphone, a touch screen, an image capture device, and the like.

The storage 1020 stores computer program instructions for implementing the corresponding steps in the text playing method according to an embodiment of the present invention.

The processor 1030 is configured to run the computer program instructions stored in the storage 1020 to perform corresponding steps of the text playing method according to the embodiment of the present invention, and is configured to implement the first text playing unit 910, the text sentence transmitting unit 920, the voice receiving unit 930, the voice determining unit 940, the storage 950, and the second playing unit 960 in the text playing apparatus according to the embodiment of the present invention.

The output device 1040 is used to output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc.

In one embodiment, the computer program instructions, when executed by the processor 1030, cause the mobile client 1000 to perform the steps of:

playing a first playing voice determined according to a first synthesized voice synthesized by a server in real time, wherein the first synthesized voice corresponds to a first text sentence in a text playing request;

when the first playing voice starts to be played, sending a next text sentence after the first text sentence to the server so that the server can synthesize a second synthesized voice corresponding to the next text sentence in real time;

receiving the second synthesized voice returned by the server;

determining a second played voice based on the second synthesized voice;

storing the second played voice to a playlist, wherein the playlist is used for storing the played voices in sequence;

and playing the second playing voice under the condition that all the playing voices determined according to the first synthesized voice are played.

Furthermore, according to still another aspect of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor cause the computer or the processor to execute the respective steps of the above-mentioned text playing method according to an embodiment of the present invention, and are used to implement the respective modules in the above-mentioned text playing apparatus for a mobile client or the respective modules in the above-mentioned mobile client according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

In one embodiment, the computer program instructions, when executed by a computer or processor, cause the computer or processor to perform the steps of:

receiving the second synthesized voice returned by the server;

determining a second played voice based on the second synthesized voice;

A person skilled in the art can understand specific implementations, components, and beneficial effects of the text playing apparatus, the mobile client, and the storage medium by reading the detailed description of the text playing method for the mobile client, and therefore, for brevity, no further description is provided herein. The technical scheme for playing the text of the mobile client realizes uninterrupted playing when real-time voice synthesis is carried out on the text with longer length.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some of the modules in a playback apparatus for text according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for playing text on a mobile client, wherein the text comprises a plurality of text sentences, the method comprising:

s300, receiving the second synthesized voice returned by the server;

s400, determining a second playing voice based on the second synthesized voice;

2. The text playback method of claim 1,

the step S400 includes:

converting the second synthesized voice into one or more second played voices with fixed duration;

the step S500 includes:

sequentially storing the one or more second played voices into corresponding positions of the playlist;

the method further comprises the following steps:

calculating the cache duration according to all the played voices stored in the play list;

determining the total playing time length of the text;

3. The method of claim 2, wherein the converting the second synthesized speech into one or more second played speeches with fixed duration comprises:

acquiring the sampling frequency of the second synthesized voice;

4. The method of claim 3, wherein the segmenting the second synthesized speech according to the data amount to obtain a second played speech comprises:

segmenting the second synthesized voice according to the data volume from the initial position of the second synthesized voice until the remaining voice is less than the data volume, wherein a second played voice is obtained by performing segmentation operation once;

5. The method of claim 2, wherein the determining the total duration of the text comprises:

6. The method of claim 5, wherein displaying the playing progress while playing the second playing voice further comprises:

updating the playing progress based on a fixed frequency.

7. The text playing method according to claim 2, wherein the method further comprises:

and responding to the instruction for adjusting the playing progress, and re-determining and playing the playing voice to be played currently.

8. A text playback apparatus for a mobile client, comprising:

9. A mobile client, comprising: a processor and a memory, wherein the memory has stored therein computer program instructions, wherein the computer program instructions, when executed by the processor, are for performing the text playback method of any of claims 1 to 7.

10. A storage medium on which program instructions are stored, characterized in that the program instructions are adapted to perform a text playback method according to any one of claims 1 to 7 when executed.