CN113963680A

CN113963680A - Audio playing method, device and equipment

Info

Publication number: CN113963680A
Application number: CN202111216144.7A
Authority: CN
Inventors: 林彦伊; 周冰; 孙一波
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-01-21

Abstract

The invention discloses an audio playing method, an audio playing device and audio playing equipment, wherein the method comprises the following steps: sending an audio pre-synthesis request to a server; receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request; and according to the playing time delay parameter, sequentially and one by one playing the at least one cached correct audio file within the corresponding timing time duration until the timing time duration is ended, wherein the timing time duration corresponding to the playing time delay parameter is determined according to the estimated playing time duration of the pre-synthesized audio. By the method, the fluency of the audio playing of the client can be improved.

Description

Audio playing method, device and equipment

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio playing method, device and equipment.

Background

The speech synthesis can convert any character information into standard and smooth speech in real time to be read out, which is equivalent to installing an artificial mouth on a machine. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert character information into audible sound information, namely, to enable a machine to speak like a person.

What is commonly known as "letting a machine speak open like a person" is essentially different from a conventional sound playback apparatus (system). Conventional sound playback devices (systems), such as tape recorders, "let the machine speak" by prerecording the sound and then playing it back. This approach has significant limitations in terms of content, storage, transmission or convenience, timeliness, and the like. And any text can be converted into the speech with high naturalness at any time through computer speech synthesis, so that the machine can speak like a human.

The current Speech synthesis method, such as the realization process of TTS (Text To Speech, from Text To Speech), has at least the following problems: when the network is unstable, the playback will have a pause phenomenon.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed to provide an audio playing method, apparatus and device that overcome the above problems or at least partially solve the above problems.

According to an aspect of the embodiments of the present invention, there is provided an audio playing method applied to a client, the method including:

sending an audio pre-synthesis request to a server;

receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request;

and according to the playing time delay parameter, sequentially and one by one playing the at least one cached correct audio file within the corresponding timing time duration until the timing time duration is ended, wherein the timing time duration corresponding to the playing time delay parameter is determined according to the estimated playing time duration of the pre-synthesized audio.

According to another aspect of the embodiments of the present invention, there is provided an audio playing apparatus applied to a client, the apparatus including:

the receiving and sending module is used for sending an audio pre-synthesis request to the server; receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request;

and the processing module is used for playing the at least one cached correct audio file one by one in sequence within the corresponding timing duration according to the playing delay parameter until the timing duration is ended, and the timing duration corresponding to the playing delay parameter is determined according to the estimated playing duration of the pre-synthesized audio.

According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the audio playing method.

According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium, in which at least one executable instruction is stored, and the executable instruction causes a processor to perform an operation corresponding to the audio playing method.

According to the scheme provided by the embodiment of the invention, the audio playing method is applied to the client, and the audio pre-synthesis request is sent to the server; receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request; according to the playing time delay parameter, the at least one cached correct audio file is played one by one in sequence within the corresponding time duration until the time duration is finished, the time duration corresponding to the playing time delay parameter is determined according to the estimated playing time duration of the pre-synthesized audio, so that the audio file with abnormal pause can be played, and the cached audio file with abnormal pause can be played within the time duration corresponding to the playing time delay parameter, so that the smoothness of audio playing of the client can be improved, the problem of pause of voice synthesis under the condition of network abnormality is solved, and the smoothness of audio playing of the client is improved.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flow chart of an audio playing method provided by an embodiment of the invention;

fig. 2 is a flowchart illustrating an audio playing method according to another embodiment of the present invention;

fig. 3 is a schematic structural diagram of an audio playing apparatus provided in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computing device provided in an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The following embodiments of the present invention provide an audio playing method, aiming at the problem that in the voice synthesis method in the prior art, when a network is abnormal, fluency of audio playing cannot be guaranteed, and when an audio file is abnormal, according to a playing delay parameter, a cached audio file is played within a timing duration corresponding to the playing delay parameter, so as to guarantee fluency of audio playing and avoid a pause phenomenon.

Fig. 1 shows a flowchart of an audio playing method provided by an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 11, sending an audio pre-synthesis request to a server;

step 12, receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request; here, the pre-synthesized audio is the audio data requested by one audio pre-synthesis request, for example, if the audio data requested by one audio pre-synthesis request is a segment of an electronic book, if the audio data requested by one audio pre-synthesis request is a segment, the audio file is a sentence in the segment;

and step 13, according to the playing time delay parameter, sequentially and one by one playing the at least one cached correct audio file within the corresponding timing duration until the timing duration is over, wherein the timing duration corresponding to the playing time delay parameter is determined according to the estimated playing duration of the pre-synthesized audio.

In the embodiment, the audio pre-synthesis request is sent to the server; receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request; and according to the playing time delay parameter, playing the at least one cached correct audio file one by one in sequence within the corresponding timing duration until the timing duration is ended, thereby improving the fluency of audio playing of the client.

In an optional embodiment of the present invention, after step 12, the audio playing method may further include:

step 121, storing the at least one audio file into a cache according to the receiving sequence.

In one specific implementation example, the client sends a text-carrying audio pre-synthesis request to the server for the first time, for example, the current synthesis capability standard is 500 characters at maximum per request.

After receiving the first synthesis request from the client, the server synthesizes the text into a plurality of audio files according to the unit of sentences/paragraphs in the text, and sends the plurality of audio files to the client in sequence.

The client receives a plurality of audio files synthesized once in sequence and stores the audio files into a cache of the local client in sequence.

The client calculates the time length (unit may be millisecond) that all audio files of one audio pre-synthesis request need to be played based on the received cache data of a plurality of audio files, the playing speed of the player set by the client, and other parameters.

In an alternative embodiment of the present invention, step 13 may include:

step 131, obtaining the playing time delay parameter according to the estimated playing time of the pre-synthesized audio;

during specific implementation, the receiving time of the audio file cached in the first cache in the cache is used as the initial playing time of the pre-synthesized audio, and the total playing time of the at least one audio file is estimated according to the playing speed of a player to obtain the ending playing time of the pre-synthesized audio;

obtaining an estimated playing time of the pre-synthesized audio according to the initial playing time and the ending playing time of the pre-synthesized audio, and taking the estimated playing time of the pre-synthesized audio as a time corresponding to the playing time delay parameter;

step 132, according to the playing delay parameter, within the corresponding timing duration, sequentially playing the at least one cached audio file one by one.

In concrete implementation, the at least one audio file which is cached correctly is continuously played within the timing duration corresponding to the playing delay parameter until the duration corresponding to the playing delay parameter is finished.

In this embodiment, when the client locates the first audio file received, the audio file starts to be played; and starting a time delay parameter timer while playing the audio file, wherein the timing duration of the time delay parameter timer is the duration corresponding to the playing time delay parameter, and starting a message asynchronous processing mechanism when the playing is abnormal. Here, the timing duration of the delay parameter timer is equal to the total playing duration of all audio files of one audio pre-synthesis request at the current playing rate. And providing field information such as the sequence number, the play start time, the play end time, the play duration and the like of each pre-synthesized audio segment externally.

When the playback abnormality is caused by a network abnormality, it may cause a plurality of pre-synthesized audios received according to a plurality of pre-synthesis requests to exist in a client cache. At this time, the client starts an information asynchronous processing mechanism to process the pre-synthesized audio in the cache, and the method mainly comprises the following steps: when a network exception prompt is received or an error code is generated, exception information/error codes are not processed, and the information is cached to a cache, wherein the cache is preferably a cache of a memory. And meanwhile, continuously playing the current pre-synthesized audio until the time delay parameter timer finishes timing, and reporting error codes, network abnormal information and the related information of the playing progress. And the client screens out a correct pre-synthesized audio from the cache to play according to the played information of the text, the audio sequence number, the audio duration and the like of the pre-synthesized audio and the time delay parameter related data. While releasing all other pre-synthesized audio buffered.

The above network anomaly determination may be implemented by using the existing SDK (software tool kit) anomaly detection and distribution mechanism detection, network anomaly prompt, error code generation, and the like to find out network anomaly.

And playing each pre-synthesized audio until the time delay parameter timer finishes timing, when the network is abnormal, after the client starts the message asynchronous processing mechanism, only caching the abnormal/error information, and each pre-synthesized audio needs to keep a playing state before the time delay parameter timer finishes.

In an optional embodiment of the present invention, when the duration corresponding to the play delay parameter is ended, the method may further include:

step 14, obtaining the playing progress, abnormal playing information and error codes of the pre-synthesized audio;

and step 15, reporting the playing progress, abnormal playing information and error codes of the pre-synthesized audio to the server.

In this embodiment, after each pre-synthesized audio is played and the delay parameter timer is finished, the error code and the data related to the current playing progress are reported. And simultaneously, the client screens out a correct pre-synthesized audio from the memory buffer for playing according to the information of the text, the audio sequence number, the audio duration and the like of the played pre-synthesized audio, the play start time, the play end time and the play duration data. After playing of the pre-synthesized audio is started, the client deletes all the residual pre-synthesized audio in the memory, and initiates a next pre-synthesis request again according to the text content of the currently played pre-synthesized audio until no network exception is reported, and after the playing of the current pre-synthesized audio is finished, the message asynchronous processing mechanism is closed, and the normal synthesis flow is recovered.

In an optional embodiment of the present invention, the audio playing method may further include:

and step 16, releasing the cached error audio file. So that the memory can continue to cache the audio file of the next audio pre-synthesis request.

In an optional embodiment of the present invention, in the audio playing method, after releasing the cached incorrect audio file, the method may further include:

step 17, determining the starting position of the next audio pre-synthesis request according to the playing progress of the pre-synthesized audio;

step 18, initiating a next audio pre-synthesis request to the server according to the starting position of the next audio pre-synthesis request;

and step 19, receiving at least one audio file of the next pre-synthesized audio sent by the server according to the next audio pre-synthesis request.

In this embodiment, the client determines the start position of the next pre-synthesis request in combination with the text content corresponding to the pre-synthesis audio played this time, and initiates the next audio pre-synthesis request. And closing the message asynchronous processing mechanism until no network abnormal condition exists.

The following describes a specific implementation flow of the method with reference to the flow shown in fig. 2:

the server issues pre-synthesized audio according to the pre-synthesis request of the client;

when the client side starts to play the received first audio file, the client side immediately sends an audio synthesis request to the server side again to pre-synthesize the next pre-synthesized audio, the server sends a plurality of audio files synthesized by the request to the client side in sequence after synthesis is finished, and the client side caches the audio files in the memory of the client side in sequence after receiving the audio files.

And under the normal condition, after the first audio file is received by the client, playing is carried out, meanwhile, the timer is started, the time length corresponding to the time delay parameter is timed, the playing of the whole pre-synthesized audio is finished, and the next pre-synthesized audio is continuously played.

When an abnormality occurs, the client must wait for the completion of the timing of the time delay parameter of the abnormality (the buffered audio file is continuously played), and then report error codes and playing progress information.

And when the client reports error codes, the client selects the correct next pre-synthesized audio from the cache data to play according to the request text, the audio sequence number, the related information of the playing progress and the time delay parameter.

And releasing all the wrong caches according to the correct duration of the next section of audio time delay parameter, and carrying out the next section of audio pre-synthesis operation according to a text-to-speech (TTS) synthesis mechanism.

The embodiment of the invention solves the problem of audio playing blockage in the playing process of voice synthesis under the abnormal condition of the network, and improves the fluency of audio playing of the client. The scheme of the embodiment can greatly improve the user perception especially under the long-time speech synthesis capability use scenes, such as reading and listening to books, reading news and the like.

Fig. 3 shows a schematic structural diagram of an audio playing apparatus 30 according to an embodiment of the present invention. As shown in fig. 3, the apparatus 30 is applied to a client, and includes:

a transceiver module 31, configured to send an audio pre-synthesis request to a server; receiving at least one audio file of pre-synthesized audio sent by the server according to the audio pre-synthesis request;

and the processing module 32 is configured to play the at least one cached correct audio file one by one in sequence within the corresponding timing duration according to the play delay parameter until the timing duration is ended, where the timing duration corresponding to the play delay parameter is determined according to the estimated play duration of the pre-synthesized audio.

Optionally, the transceiver module 31 is further configured to store the at least one audio file in a buffer according to a receiving sequence.

Optionally, playing the at least one cached audio file one by one in sequence within a corresponding timing duration according to the playing delay parameter, including:

obtaining the playing time delay parameter according to the estimated playing time of the pre-synthesized audio;

and according to the playing time delay parameter, within the corresponding timing duration, sequentially playing the at least one cached audio file one by one.

Optionally, obtaining the play delay parameter according to the estimated play duration of the pre-synthesized audio includes: estimating the total playing time of the at least one audio file according to the playing speed of a player to obtain the playing ending time of the pre-synthesized audio;

and obtaining the estimated playing time of the pre-synthesized audio according to the initial playing time and the ending playing time of the pre-synthesized audio, and taking the estimated playing time of the pre-synthesized audio as the playing time delay parameter.

Optionally, when the timing duration corresponding to the play delay parameter is ended, the processing module 32 is further configured to: obtaining the playing progress, abnormal playing information and error codes of the pre-synthesized audio; and reporting the playing progress, abnormal playing information and error codes of the pre-synthesized audio to the server.

Optionally, the processing module 32 is further configured to release the cached erroneous audio file.

Optionally, the transceiver module 31 is further configured to: determining the initial position of the next audio pre-synthesis request according to the playing progress of the pre-synthesized audio;

initiating a next audio pre-synthesis request to the server according to the starting position of the next audio pre-synthesis request;

and receiving at least one audio file of the next pre-synthesized audio sent by the server according to the next audio pre-synthesis request.

It should be noted that the apparatus is an apparatus corresponding to the above method, and all the implementations in the above method embodiment are applicable to the embodiment of the apparatus, and the same technical effects can be achieved.

An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute an audio playing method in any of the above method embodiments.

Fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

As shown in fig. 4, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.

Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. The processor is used for executing the program, and particularly can execute the relevant steps in the audio playing method embodiment for the computing device.

In particular, the program may include program code comprising computer operating instructions.

The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program may specifically be adapted to cause a processor to execute the audio playing method in any of the above-described method embodiments. For specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing embodiments of the audio playing method, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed embodiments of the invention require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. An audio playing method applied to a client, the method comprising:

sending an audio pre-synthesis request to a server;

2. The audio playing method according to claim 1, wherein after receiving at least one audio file of pre-synthesized audio delivered by the server according to the audio pre-synthesis request, the method further comprises:

and storing the at least one audio file into a cache according to the receiving sequence.

3. The audio playing method according to claim 1, wherein playing the at least one buffered audio file one by one in sequence within a corresponding timing duration according to the playing delay parameter comprises:

4. The audio playing method according to claim 3, wherein obtaining the playing delay parameter according to the estimated playing duration of the pre-synthesized audio comprises:

estimating the total playing time of the at least one audio file according to the playing speed of a player to obtain the playing ending time of the pre-synthesized audio;

5. The audio playing method according to claim 1, wherein the timing duration, when ended, further comprises:

obtaining the playing progress, abnormal playing information and error codes of the pre-synthesized audio;

and reporting the playing progress, abnormal playing information and error codes of the pre-synthesized audio to the server.

6. The audio playing method according to claim 5, further comprising:

and releasing the cached error audio file.

7. The audio playing method of claim 6, further comprising, after releasing the buffered erroneous audio file:

determining the initial position of the next audio pre-synthesis request according to the playing progress of the pre-synthesized audio;

8. An audio playing apparatus, applied to a client, the apparatus comprising:

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the audio playing method according to any one of claims 1-7.

10. A computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the audio playback method according to any one of claims 1 to 7.