CN111459445A

CN111459445A - Webpage end audio generation method and device and storage medium

Info

Publication number: CN111459445A
Application number: CN202010127254.5A
Authority: CN
Inventors: 郁霖; 雷欣; 李志飞
Original assignee: Wenwen Intelligent Information Technology Co ltd
Current assignee: Wenwen Intelligent Information Technology Co ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-07-28

Abstract

The invention relates to the technical field of audio processing, and discloses a webpage end audio generation method, which is used for converting a text of a webpage end into an audio capable of being played at the webpage end and comprises the following steps: receiving text information and sending the text information to a text-to-speech server; receiving a plurality of segmented audio streams corresponding to text information and fed back by a text-to-speech server in a segmented manner; constructing an audio output stream; and inputting waveform audio file format wav header information in an audio output stream, and sequentially inputting a plurality of segmented audio streams. According to the invention, after the text information of the webpage end is received, the audio output stream is newly established, the wav header information is input in the audio output stream, and the segmented audio stream converted by adopting the voice-to-text server is sequentially input, so that the audio stream can be directly played as the wav format audio at the webpage end, a user at the webpage end can listen to the high-quality audio, the waiting time for audio generation is reduced, and in addition, the situation that a PCM player is arranged at the webpage end is avoided.

Description

Webpage end audio generation method and device and storage medium

Technical Field

The invention relates to the technical field of audio processing, in particular to a webpage end audio generation method, a webpage end audio generation device and a webpage end audio generation storage medium.

Background

The TTS (Text To Speech) technology is widely used for online Speech generation and playing, and has wider application requirements from phrase generation To article reading, for example, converting a webpage end Text into Audio for playing, the technical application of the TTS in the aspect of phrase generation is mature, but for the processing of a long article, the TTS needs To transmit the generated Audio To a webpage end after the processing of the long article is completed, the conversion of the long article from the Text To the Audio needs To be completed, and the time problem of online waiting for Audio generation of a webpage end user needs To be considered.

Disclosure of Invention

In order to solve or at least partially solve the technical problem, embodiments of the present invention provide a method and an apparatus for generating a webpage-side audio.

According to a first aspect of the embodiments of the present invention, there is provided a method for generating a webpage-side audio, which is used to convert a text of a webpage side into an audio that can be played at the webpage side, the method including: receiving text information and sending the text information to a text-to-speech server; receiving a plurality of segmented audio streams corresponding to the text information and fed back by the text-to-speech server in a segmented manner; constructing an audio output stream; and inputting waveform audio file format wav header information into the audio output stream, and sequentially inputting a plurality of the segmented audio streams.

Preferably, the text information is received and sent to the text-to-speech server, both of which are transmitted using the hypertext transfer protocol HTTP.

Preferably, said inputting waveform audio file format wav header information in said audio output stream includes: monitoring whether the segmented audio stream is received for the first time; and inputting waveform audio file format wav header information in the audio output stream when the segmented audio stream is received for the first time.

Preferably, the monitoring whether the segmented audio stream is received for the first time includes: monitoring whether the current state of the audio output stream is empty; and confirming that the segmented audio stream is received for the first time when the current state of the audio output stream is empty.

Preferably, the segmented audio stream is an audio stream in a pulse code modulation, PCM, format.

According to a second aspect of the embodiments of the present invention, there is also provided a web page end audio generating apparatus, including: the text transmission module is used for receiving text information and sending the text information to the text-to-speech server; the receiving module is used for receiving a plurality of segmented audio streams which are returned by the text-to-speech server in a segmented mode and correspond to the text information; the building module is used for building an audio output stream; and the audio transmission module is used for inputting waveform audio file format wav header information in the audio output stream and sequentially inputting a plurality of the segmented audio streams.

Preferably, the text transmission module includes: the text receiving submodule is used for receiving text information by adopting a hypertext transfer protocol (HTTP); and the text sending submodule is used for sending the text information to a text-to-speech server by adopting a hypertext transfer protocol (HTTP).

Preferably, the audio transmission module includes: a monitoring submodule for monitoring whether the segmented audio stream is received for the first time; and the transmission submodule is used for inputting waveform audio file format wav header information into the audio output stream when the segmented audio stream is received for the first time.

According to a third aspect of the embodiments of the present invention, an embodiment of the present invention further provides a machine-readable storage medium, where instructions are stored on the machine-readable storage medium, and the instructions are configured to enable the machine-readable storage medium to execute the above-mentioned webpage-side audio generating method.

According to a fourth aspect of the embodiments of the present invention, there is also provided an apparatus, including at least one processor, and at least one memory and a bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the web page side audio generation method of any one of claims 1-5.

Through the technical scheme, after the text information of the webpage end is received, the audio output stream is newly built, the wav head information is input into the audio output stream, and the segmented audio stream converted by the voice-to-text server is sequentially input, so that the audio stream can be directly played as the wav-format audio at the webpage end, a user at the webpage end can listen to the high-quality audio, the waiting time for audio generation is reduced, and in addition, the situation that a PCM player is arranged at the webpage end is avoided.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the figure:

fig. 1 is a flowchart illustrating a method for generating webpage-side audio according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a specific application example of the method for generating webpage-side audio according to the embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a component structure of a web page-side audio generating apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a composition structure of a text transmission module according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a structure of an audio transmission module according to an embodiment of the present invention.

Description of the reference numerals

301. Text transmission module 302 and receiving module

303. Construction module 304 and audio transmission module

3011. Text receiving submodule 3012 and text sending submodule

3041. Monitoring submodule 3042, transmission submodule

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given only to enable those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The technical solution of the present invention is further elaborated below with reference to the drawings and the specific embodiments.

Fig. 1 shows a flowchart of a method for generating webpage-side audio according to an embodiment of the present invention.

Referring to fig. 1, a method for generating a webpage end audio according to an embodiment of the present invention is used to convert a text of a webpage end into an audio that can be played at the webpage end, and may include the following steps:

and S100, receiving the text information and sending the text information to a text-to-speech server.

Specifically, the web page back-end server receives the Text information sent by the web page and sends the received Text information To a TTS (Text To Speech) server.

In the embodiment of the invention, the text information is received and sent to the text-to-speech server, and the text information is transmitted by adopting a hypertext transfer protocol (HTTP). HTTP refers to hypertext transfer protocol, which is a stateless protocol. The client sends a request once, the server receives the request, the request is processed and returned to the client, and then the link between the client and the server is disconnected. For example: when the web page client sends the text information to the web page back end server by the HTTP protocol, the web page client waits for the response of the web page back end server to the text information. And the web page back-end server sends the text information to the TTS server and waits for the response of the TTS server to the text.

S200, receiving a plurality of segmented audio streams corresponding to the text information, which are segmented and returned by the text-to-speech server.

And the web page back-end server is used for converting the text information into audio by the TTS server after sending the text information to the TTS server. If the whole text information is completely converted into audio and then transmitted to the webpage back-end server and then sent to the webpage end for playing, long waiting time can be brought to the webpage end user. Therefore, after receiving the text message, the TTS server returns the text message to the web page backend server every time the text message with a set length is converted.

For example, in the case of a long article, the TTS server may transmit the converted audio to the web page backend server after each conversion of a natural segment of the article. Or after the text with the set number of characters is converted into audio, the converted audio can be transmitted to the webpage back-end server.

And S300, constructing an audio output stream.

Specifically, the web page back-end server may construct the audio output stream while sending the text information to the TTS server. The audio output stream may also be constructed when the segmented audio stream returned by the TTS server is received for the first time.

S400, inputting waveform audio file format wav header information in the audio output stream.

Specifically, currently, the web page side only supports the playing of mp3 and wav format audio. However, wav format audio has header information for each wav audio, and cannot achieve continuous streaming for a plurality of audio. If a plurality of audio frequencies are played continuously in the mp3 format, excessive silent audio frequencies are introduced, which causes jitter in the middle of splicing each audio stream, and the playing effect is poor. Therefore, in order to make the user experience better, the best method is to use PCM audio playing. However, the web page side does not support direct playing of audio in PCM format, and a special PCM audio player needs to be deployed. In order to overcome the problems, the invention inputs wav header information in the newly-built audio output stream, so that the webpage end plays the audio output stream as the audio in the wav format.

In one embodiment of the invention, inputting waveform audio file format wav header information in said audio output stream is achieved by: first, it is monitored whether a segmented audio stream is received from a TTS server for the first time, and waveform audio file format wav header information is input in an audio output stream when the segmented audio stream is received for the first time.

In particular, it is possible to monitor whether the current status of the audio output stream is empty; and when the current state of the audio output stream is empty, confirming that the webpage back-end server receives the segmented audio stream returned by the TTS server for the first time.

For example, the wav header information consists of fixed 44 bytes of data, which is described in detail below:

1:

00-034 bytes 'RIFF' resource exchange file mark

header[0]＝'R'；

header[1]＝'I'；

header[2]＝'F'；

header[3]＝'F'；

2:

04-074 bytes size-8 bytes (total number of bytes from the next byte to the end of the file)

header[4]＝(char)((file_size-8)&0xff)；

header[5]＝(char)(((file_size-8)>>8)&0xff)；

header[6]＝(char)(((file_size-8)>>16)&0xff)；

header[7]＝(char)(((file_size-8)>>24)&0xff)；

3:

08-114 byte wave-file mark

header[8]＝'W'；

header[9]＝'A'；

header[10]＝'V'；

header[11]＝'E'；

4:

12-154 byte "fmt" waveform format mark, last one space

header[12]＝'f'；

header[13]＝'m'；

header[14]＝'t'；

header[15]＝”；

5:

16 to 194 bytes filter (in general 00000010H)

header[16]＝16；

header[17]＝0；

header[18]＝0；

header[19]＝0；

6:

20-212 byte format type (when the value is 1, the data is linear PCM code)

header[20]＝1；

header[21]＝0；

7:

22-232 byte channels, the single channel is 1, the double channel is 2

header[22]＝(char)channel；

header[23]＝0；

8:

Sampling rate of 24-274 bytes

header[24]＝(char)(sample_rate&0xff)；

header[25]＝(char)((sample_rate>>8)&0xff)；

header[26]＝(char)((sample_rate>>16)&0xff)；

header[27]＝(char)((sample_rate>>24)&0xff)；

9:

28 to 314 Byte bit rate (Byte rate: sampling frequency: number of audio channels: number of samples per sample/8)

header[28]＝(char)(bit_rate&0xff)；

header[29]＝(char)((bit_rate>>8)&0xff)；

header[30]＝(char)((bit_rate>>16)&0xff)；

header[31]＝(char)((bit_rate>>24)&0xff)；

10:

The data block length is 32-332 bytes (the number of bytes of each sample is equal to the number of channels and the number of sample bits obtained by each sampling/8).

header[32]＝(char)(channel*sample_bit/8)；

header[33]＝0；

11:

34-352 bytes per sample point.

header[34]＝(char)sample_bit；

header[35]＝0；

12:

36-394 bytes of data identifier.

header[36]＝'d'；

header[37]＝'a'；

header[38]＝'t'；

header[39]＝'a'；

13:

40-434 byte PCM audio data size

header[40]＝(char)(data_size&0xff)；

header[41]＝(char)((data_size>>8)&0xff)；

header[42]＝(char)((data_size>>16)&0xff)；

header[43]＝(char)((data_size>>24)&0xff)；

S500, sequentially inputting a plurality of segmented audio streams.

Specifically, the TTS server converts the text information into a plurality of segmented audio streams, and then sequentially transmits the audio streams to the web page backend server, and the web page backend server sequentially inputs the received audio streams into an audio output stream to be played at the web page.

The time for the TTS server to convert the text information into the audio is far shorter than the playing time of the audio at the webpage end. Therefore, after the wav header information and the first segment of the audio stream are input in the audio output stream, the webpage end can play the first segment of audio, and before the first segment of audio is played, the TTS second segment of audio can convert the second segment of audio completely and transmit the second segment of audio to the webpage back-end server. And analogizing in sequence, before the previous section of audio is played, the next adjacent section of audio is converted and transmitted to the webpage back-end server. Therefore, the webpage end can play the audio output stream as a complete wav format audio without the phenomena of jitter, discontinuity and the like.

In the embodiment of the present invention, the segmented audio stream uses an audio stream in a pulse code modulation PCM format, but the present invention is not limited thereto, and the segmented audio stream may use any suitable audio format.

It should be noted that, in the embodiment of the present invention, the web page backend server may be implemented based on one or more of the following programming languages: JAVA, python, c + +, c #, php to realize the webpage-side audio generation method. However, the present invention is not limited to this, and the web backend server may implement the web end audio generation method based on any other suitable language.

Fig. 2 is a flowchart illustrating a specific application example of the method for generating webpage-side audio according to the embodiment of the present invention.

Referring to fig. 2, the method for generating webpage-side audio in this application example may include the following steps: s1, the webpage end sends text information to a webpage back-end server; s2, the webpage back-end server sends the text information to the TTS server; s3, constructing an audio output stream by the webpage back-end server; s4, the TTS server returns PCM audio stream in a segmented mode; s5, monitoring whether the PCM audio stream is returned for the first time; s6, inputting waveform audio file format wav header information in an audio output stream when receiving a segmented audio stream returned by a TTS server for the first time; s7, continuing to input the segmented PCM audio in the audio output stream.

Based on the foregoing method for generating a webpage-side audio, an embodiment of the present invention further provides a device for generating a webpage-side audio, and fig. 3 shows a schematic diagram of a structure of the device for generating a webpage-side audio according to the embodiment of the present invention, and as shown in fig. 3, the device for generating a webpage-side audio may include: the text transmission module 301 is configured to receive text information and send the text information to a text-to-speech server; a receiving module 302, configured to receive a plurality of segmented audio streams corresponding to text information and returned by a text-to-speech server in a segmented manner; a construction module 303 for constructing an audio output stream; and an audio transmission module 304, configured to input waveform audio file format wav header information in an audio output stream, and sequentially input a plurality of segmented audio streams.

Fig. 4 is a schematic diagram illustrating a composition structure of a text transmission module according to an embodiment of the present invention, and referring to fig. 4, according to an embodiment of the present invention, a text transmission module 301 according to an embodiment of the present invention includes: the text receiving submodule 3011 is configured to receive text information using a hypertext transfer protocol HTTP; and a text sending sub-module 3012, configured to send the text information to the text-to-speech server by using a hypertext transfer protocol HTTP.

Fig. 5 is a schematic diagram illustrating a composition structure of an audio transmission module according to an embodiment of the present invention, and referring to fig. 5, according to an embodiment of the present invention, an audio transmission module 304 according to an embodiment of the present invention includes: a monitoring submodule 3041 for monitoring whether the segmented audio stream is received for the first time; and a transmitting submodule 3042 for inputting waveform audio file format wav header information in the audio output stream when the segmented audio stream is received for the first time.

For other specific implementation details and beneficial effects of the web-side audio generating device, reference is made to the above-mentioned speech recognition method, and for technical details not disclosed in the embodiment of the web-side audio generating device of the present invention, reference is made to the description of the method embodiment shown in fig. 1 to 2 of the present invention for understanding, so that details are not repeated for brevity.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solutions of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications all belong to the protection scope of the embodiments of the present invention.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention do not describe every possible combination.

The webpage-side audio generating device comprises a processor and a memory, wherein the text transmission module, the receiving module, the constructing module, the audio transmission module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the technical problem to be solved by the application is solved by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium on which a program is stored, where the program, when executed by a processor, implements a web page side audio generation method.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor realizes the webpage end audio generation method when executing the program. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A webpage-side audio generating method, configured to convert a text of a webpage side into audio that can be played on the webpage side, the method comprising:

receiving text information and sending the text information to a text-to-speech server;

receiving a plurality of segmented audio streams corresponding to the text information and fed back by the text-to-speech server in a segmented manner;

constructing an audio output stream; and

and inputting waveform audio file format wav header information into the audio output stream, and sequentially inputting a plurality of segmented audio streams.

2. The method for generating webpage-side audio according to claim 1, wherein the receiving the text message and the sending the text message to the text-to-speech server are both transmitted using a hypertext transfer protocol HTTP.

3. The method for generating audio at a web site according to claim 1, wherein said inputting waveform audio file format wav header information in said audio output stream comprises:

monitoring whether the segmented audio stream is received for the first time; and

inputting waveform audio file format wav header information in the audio output stream when the segmented audio stream is received for the first time.

4. The method for generating webpage side audio according to claim 3, wherein the monitoring whether the segmented audio stream is received for the first time comprises:

monitoring whether the current state of the audio output stream is empty; and

and when the current state of the audio output stream is empty, confirming that the segmented audio stream is received for the first time.

5. The method for generating audio on a web page side according to claim 1, wherein the segmented audio stream is an audio stream in a Pulse Code Modulation (PCM) format.

6. A web page end audio generating device, wherein the web page end audio generating device comprises:

the text transmission module is used for receiving text information and sending the text information to the text-to-speech server;

the receiving module is used for receiving a plurality of segmented audio streams which are returned by the text-to-speech server in a segmented mode and correspond to the text information;

the building module is used for building an audio output stream; and

and the audio transmission module is used for inputting waveform audio file format wav header information into the audio output stream and sequentially inputting a plurality of the segmented audio streams.

7. The apparatus for generating audio on web page side according to claim 5, wherein the text transmission module comprises:

the text receiving submodule is used for receiving text information by adopting a hypertext transfer protocol (HTTP); and

and the text sending submodule is used for sending the text information to a text-to-speech server by adopting a hypertext transfer protocol (HTTP).

8. The apparatus for generating audio on web page side according to claim 5, wherein the audio transmission module comprises:

a monitoring submodule for monitoring whether the segmented audio stream is received for the first time; and

and the transmission submodule is used for inputting waveform audio file format wav header information into the audio output stream when the segmented audio stream is received for the first time.

9. A machine-readable storage medium having stored thereon instructions for enabling the machine-readable storage medium to execute the method for webpage-side audio generation according to any one of claims 1-5.

10. An apparatus comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the web page side audio generation method of any one of claims 1-5.