CN113851106B

CN113851106B - Audio playing method and device, electronic equipment and readable storage medium

Info

Publication number: CN113851106B
Application number: CN202110942080.2A
Authority: CN
Inventors: 高聪; 崔璐; 白洁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2023-01-06
Anticipated expiration: 2041-08-17
Also published as: CN113851106A

Abstract

The disclosure provides an audio playing method, an audio playing device, electronic equipment and a readable storage medium, and relates to the technical field of artificial intelligence such as voice processing and deep learning. The audio playing method comprises the following steps: acquiring a sentence to be played; obtaining voice emotion, voice tone, scene sound effect and background music of the sentence to be played according to the text content of the sentence to be played; generating a target audio of the sentence to be played by using the voice emotion and the voice timbre; and playing the target audio and the scene sound effect and the background music. The method and the device can improve the user listening feeling during audio playing and enhance the reality and vividness of the audio playing.

Description

Audio playing method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of computer technology, and more particularly to the field of artificial intelligence techniques such as speech processing and deep learning. An audio playing method, an audio playing device, an electronic device and a readable storage medium are provided.

Background

At present, a large number of APPs or software with audio playing functions such as speaking and telling stories exist in the market. The APP with audio playing function or the audio resource in the software are synthesized by the corresponding text by using the speech synthesis technology. However, the audio generated by the existing speech synthesis technology is very stiff and monotonous, which results in poor listening feeling of the user.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided an audio playing method, including: obtaining a sentence to be played; obtaining the voice emotion, the voice tone, the scene sound effect and the background music of the sentence to be played according to the text content of the sentence to be played; generating a target audio of the sentence to be played by using the voice emotion and the voice timbre; playing the target audio and the scene sound effect and the background music

According to a second aspect of the present disclosure, there is provided an audio playback apparatus including: the acquisition unit is used for acquiring sentences to be played; the processing unit is used for obtaining the voice emotion, the voice tone, the scene sound effect and the background music of the sentence to be played according to the text content of the sentence to be played; the generating unit is used for generating the target audio of the sentence to be played by using the voice emotion and the voice tone; and the playing unit is used for playing the target audio and playing the scene sound effect and the background music.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

According to the technical scheme, the listening sense of the user during audio playing can be improved, and the reality and vividness of the audio playing are enhanced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

fig. 3 is a block diagram of an electronic device for implementing an audio playing method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the audio playing method of this embodiment may specifically include the following steps:

s101, obtaining a sentence to be played;

s102, obtaining voice emotion, voice tone, scene sound effect and background music of the sentence to be played according to the text content of the sentence to be played;

s103, generating a target audio of the sentence to be played by using the voice emotion and the voice tone;

and S104, playing the target audio and playing the scene sound effect and the background music.

According to the audio playing method, after the sentence to be played is obtained, firstly, the voice emotion, the voice tone, the scene sound effect and the background music of the sentence to be played are obtained according to the text content of the sentence to be played, then, the target audio of the sentence to be played is generated by using the voice emotion and the voice tone, and finally, the obtained scene sound effect and the background music are played while the target audio is played.

In this embodiment, the to-be-played sentence obtained in S101 may be a dialogue sentence and/or an offset sentence in a to-be-played text, and the to-be-played text may be various types of texts, such as a novel text; in this embodiment, the to-be-played sentence obtained by executing S101 may include one or more sentences. Therefore, in the embodiment, the playing of the text to be played can be completed by sequentially acquiring each sentence to be played in the text to be played.

In this embodiment, after the to-be-played sentence is acquired in step S101, step S102 is executed to obtain the speech emotion, the speech timbre, the scene sound effect and the background music of the to-be-played sentence according to the text content of the to-be-played sentence.

It can be understood that, if the to-be-played sentence obtained by executing S101 in the present embodiment includes a plurality of sentences, the voice emotion, voice timbre, scene sound effect, and background music of each sentence can be obtained by executing S102 in the present embodiment.

Wherein, the embodiment executes the voice emotion of the sentence to be played obtained in S102, which is used to represent the emotion type of the sentence to be played, and the emotion type may be ordinary, happy, worried, angry, anecdotal, zhevery, disconcerous, doubtful, cynicism, etc.; the voice tone of the sentence to be played is used for representing a speaking role in the sentence to be played, and the speaking role can be a voice-aside role or a specific character role; the scene sound effect of the sentence to be played is used for representing the scene appearing in the sentence to be played, and the scene can be rain, wind, thunder and the like; the background music of the sentence to be played is used for representing music corresponding to the emotion type of the sentence to be played, such as music corresponding to 'hurry' and music corresponding to 'happy' and the like.

The voice timbre of the to-be-played sentence obtained by executing S102 in this embodiment may correspond to a specific character in the to-be-played sentence, for example, a name of a certain character in a novel; it may also correspond to attribute information of a person in the sentence to be played, for example, to the age, sex, occupation, etc. of the person in the novel.

Specifically, when S102 is executed to obtain the speech emotion of the sentence to be played according to the text content of the sentence to be played in the embodiment, the optional implementation manner that may be adopted is: determining the emotion type of a sentence to be played; and taking the determined emotion type as the voice emotion of the sentence to be played.

In step S102, the emotion type of the sentence to be played may be determined according to the emotion words in the sentence to be played, for example, words such as "happy and unhappy" preset in the sentence to be played, an emotion type "happy" corresponding to the emotion word "happy and" unhappy "corresponding to the emotion word" unhappy ", and the like are extracted from the sentence to be played.

That is to say, the embodiment executes the speech emotion obtained in S102, that is, the emotion to be reflected when playing the audio, and by obtaining the speech emotion of the sentence to be played, the emotion change of different characters can be reflected more accurately when playing the audio.

In this embodiment, when S102 is executed to obtain the speech emotion of the sentence to be played according to the text content of the sentence to be played, the sentence to be played may be input to the emotion recognition model obtained by the pre-training, and the speech emotion of the sentence to be played is determined according to the output result of the emotion recognition model, for example, the emotion type with the maximum probability value output by the emotion recognition model is used as the speech emotion of the sentence to be played.

Specifically, when executing S102 to obtain the voice timbre of the to-be-played sentence according to the text content of the to-be-played sentence, the embodiment may adopt an optional implementation manner as follows: determining a speaking role in a sentence to be played; extracting a role identification corresponding to the determined speaking role, such as Zhang III, teacher, father, mother and the like; and obtaining the voice tone of the sentence to be played according to the extracted role identification.

In this embodiment, a correspondence table between roles and timbres may be preset, and the timbre corresponding to the extracted role identifier in the correspondence table is used as the voice timbre of the sentence to be played.

In this embodiment, when S102 is executed to determine the speaking role in the sentence to be played, the optional implementation manner that can be adopted is as follows: under the condition that the sentence to be played is determined to contain the specific symbol, determining a speaking role in the sentence to be played from text contents before and/or after the specific symbol in the sentence to be played, for example, taking a role corresponding to words such as 'say', 'shout' and the like in the text contents as the speaking role.

The specific symbol in the present embodiment is a symbol that is set in advance and used to identify a spoken sentence, and the specific symbol may be a quotation mark, such as a single quotation mark ('), a double quotation mark ("), or a bracket, such as a double angle bracket (").

For example, if the embodiment executes the sentence to be played obtained in S101, that is, [ captain indicates we sit down with hands, and then the low voice says to the executive: mr. Rorui, this student I handed you to let him go to the class of five grades, and let him go to the class of five grades if he has both the highest grade and the lowest grade. "in this embodiment, S102 is executed to determine that the sentence to be played includes the double quotation marks, and further determine that the speaking role in the sentence to be played is the" school leader "from the text content located before the double quotation marks.

For example, if the present embodiment executes the sentence to be played acquired in S101 as [ she says: "yesterday i's father bought me a lovely dog, dad said: ' you want to care for dog! ' I say: ' good! ' the present embodiment performs S102 to determine that the sentence to be played contains the double quotation marks and the single quotation marks, and further determine that the speaking roles in the sentence to be played are "her" and "dad" from the text content before the double quotation marks and the text content before the single quotation marks.

For example, if the sentence to be played obtained by executing S101 in the present embodiment is "please give me", he says it to the clerk. In this embodiment, S102 is executed to determine that the sentence to be played includes the bracket, and further determine that the speaking role in the sentence to be played is "he" from the text content located after the bracket.

It can be understood that, if the present embodiment executes S102 to determine that the to-be-played sentence does not include the specific symbol, or when the text content before and/or after the specific symbol does not obtain the speaking role, it indicates that the current to-be-played sentence is the voice-over sentence, the voice tone of the to-be-played sentence is set as a default voice tone, for example, as the voice tone of "voice-over".

That is to say, in this embodiment, the voice tone obtained in step S102 is the tone to be reflected when playing the audio, and the voice tone of the sentence to be played is obtained, so that the change between different roles can be more accurately reflected when playing the audio.

Specifically, when the S102 is executed to obtain the scene sound effect of the to-be-played sentence according to the text content of the to-be-played sentence, the optional implementation manner that can be adopted in the embodiment is as follows: extracting scene words from the sentence to be played, for example, extracting preset words such as "strike a thunder", "blow and rain", "snow", "hiss", "laugh", and the like from the sentence to be played; and obtaining the scene sound effect of the sentence to be played according to the extracted scene words, for example, when the scene words contained in the sentence to be played are thunder, the obtained scene sound effect is thunder.

In this embodiment, a corresponding relationship table between scenes and sound effects may be preset, and the sound effect corresponding to the extracted scene word in the corresponding relationship table is used as the scene sound effect of the sentence to be played.

In this embodiment, when the S102 is executed to obtain the scene sound effect of the to-be-played sentence according to the extracted scene word, the extracted scene word may be input into the sound effect generation model obtained by pre-training, so that the output result of the sound effect generation model is used as the scene sound effect of the to-be-played sentence.

That is to say, the present embodiment executes the scene sound effect obtained in S102, that is, the scene to be reflected when the audio is played, so that the atmosphere can be set off when the audio is played by obtaining the scene sound effect of the sentence to be played, and the feeling that the user is personally on the scene is increased.

Specifically, when executing S102 to obtain the background music of the sentence to be played according to the text content of the sentence to be played, the embodiment may adopt an optional implementation manner as follows: determining the emotion type of a statement to be broadcasted; and obtaining background music of the sentence to be played according to the determined emotion type, for example, when the sentence to be played contains a word of 'hard past', the obtained background music is sad music.

In this embodiment, a correspondence table between emotion words and music may be preset, and the music corresponding to the extracted emotion words in the correspondence table is used as the background music of the sentence to be played.

In this embodiment, when S102 is executed to obtain the background music of the sentence to be played according to the determined emotion type, the determined emotion type may be input into a music generation model obtained by pre-training, so that the output result of the music generation model is used as the background music of the sentence to be played.

That is, the present embodiment executes the background music obtained in S102, that is, the music for reflecting the atmosphere when the audio is played, so that the listening sensation of the user can be increased when the audio is played by obtaining the background of the sentence to be played.

It can be understood that, if one or more of the voice emotion, the voice tone, the scene sound effect, and the background music of the to-be-played sentence cannot be obtained by executing S102 in this embodiment, the default voice emotion, the default voice tone, the default scene sound effect, and the default background music may be used as the voice emotion, the voice tone, the scene sound effect, and the background music of the to-be-played sentence.

In this embodiment, after the step S102 is executed to obtain the speech emotion, the speech timbre, the scene sound effect, and the background music of the to-be-played sentence, the step S103 is executed to generate the target audio of the to-be-played sentence by using the determined speech emotion and speech timbre. In this embodiment, the target audio generated by executing S103 has corresponding emotion and timbre.

In this embodiment, when S103 is executed to generate a target audio of a sentence to be played by using the determined speech emotion and speech timbre, an optional implementation manner that may be adopted is: and inputting the sentence to be played, the voice emotion and the voice tone of the sentence to be played into a pre-trained audio generation model, and taking an output result of the audio generation model as a target audio of the sentence to be played.

For example, if the present embodiment executes S102 to determine that the speech emotion corresponding to the to-be-played sentence is "anger", and the tone color of the speech is "sinkable male", the present embodiment executes S103 to generate the target audio that "angry" outputs the text content of the to-be-played sentence with the tone color of "sinkable male".

After executing S103 to generate the target audio of the sentence to be played, the present embodiment executes S104 to play the generated target audio, and plays the determined scene sound effect and the background music.

In this embodiment, when the generated target audio is played and the determined scene sound effect and the background music are played in S104, the scene sound effect and the background music may be played simultaneously after the target audio starts to be broadcasted.

In order to further improve the audio playing effect and enhance the user listening feeling, in this embodiment, when S104 is executed to play the target audio and play the scene sound effect, the optional implementation manners that can be adopted are: and when detecting that the scene words in the target audio are played, playing the scene sound effect.

That is to say, the scene sound effect is played only when the target audio is played to the scene word in the embodiment, so that the playing of the scene sound effect is more accurate, and the hearing of the user is further correct.

For example, if the sentence to be played in this embodiment is [ or that has a vague sound, it is drowned by a hiss sound on class. The sentence to be played is a whitish sentence, the target audio is generated by using the whitish tone, and the hist scene sound effect is played when the target audio is played; if the sentence to be played in this embodiment is [ voice over big! "teacher yell the way," loud point! ", the sentence to be played contains a dialogue sentence and a bystander sentence, the dialogue part (" louder sound "and" louder point ") in the target audio is generated using the timbre of" teacher "and the emotion of" anger ", and the bystander part (" teacher shouting ") in the target audio is generated using the timbre of" bystander "and the emotion of" anger ".

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in fig. 2, the audio playing device 200 of the present embodiment includes:

the obtaining unit 201 is configured to obtain a sentence to be played;

the processing unit 202 is configured to obtain a speech emotion, a speech timbre, a scene sound effect, and background music of the sentence to be played according to the text content of the sentence to be played;

the generating unit 203 is used for generating a target audio of the sentence to be played by using the voice emotion and the voice timbre;

the playing unit 204 is configured to play the target audio and play the scene sound effect and the background music.

The sentence to be played acquired by the acquiring unit 201 may be a dialogue sentence and/or a dialogue sentence in a text to be played, and the text to be played may be various types of texts; the sentence to be played acquired by the acquiring unit 201 may include one or more sentences. Therefore, in the embodiment, the playing of the text to be played can be completed by sequentially acquiring each sentence to be played in the text to be played.

In this embodiment, after the obtaining unit 201 obtains the sentence to be played, the processing unit 202 obtains the speech emotion, the speech timbre, the scene sound effect and the background music of the sentence to be played according to the text content of the sentence to be played.

The speech emotion of the sentence to be played obtained by the processing unit 202 is used for representing the emotion type of the sentence to be played; the voice tone of the sentence to be played is used for representing the speaking role in the sentence to be played; the scene sound effect of the sentence to be played is used for representing the scene appearing in the sentence to be played; and the background music of the sentence to be played is used for expressing the music corresponding to the emotion type of the sentence to be played.

The voice timbre of the to-be-played sentence obtained by the processing unit 202 may correspond to a specific character in the to-be-played sentence, or may correspond to attribute information of the character in the to-be-played sentence.

Specifically, when the processing unit 202 obtains the speech emotion of the sentence to be played according to the text content of the sentence to be played, the optional implementation manner that can be adopted is as follows: determining the emotion type of a sentence to be played; and taking the determined emotion type as the voice emotion of the sentence to be played.

The processing unit 202 may determine the emotion type of the to-be-played sentence according to the emotion words in the to-be-played sentence, and may also input the to-be-played sentence into an emotion recognition model obtained by pre-training, and determine the speech emotion of the to-be-played sentence according to the output result of the emotion recognition model.

That is to say, the speech emotion obtained by the processing unit 202 is an emotion to be reflected when playing the audio, and by obtaining the speech emotion of the sentence to be played, emotion changes of different characters can be reflected more accurately when playing the audio.

Specifically, when the processing unit 202 obtains the voice timbre of the sentence to be played according to the text content of the sentence to be played, the optional implementation manner that can be adopted is as follows: determining a speaking role in a sentence to be played; extracting a role identification corresponding to the determined speaking role; and obtaining the voice tone of the sentence to be played according to the extracted role identification.

When the processing unit 202 determines the speaking role in the sentence to be played, the optional implementation manners that can be adopted are as follows: and under the condition that the sentence to be played contains the specific symbol, determining the speaking role in the sentence to be played from the text content before and/or after the specific symbol in the sentence to be played.

It is to be understood that, if the processing unit 202 determines that the to-be-played sentence does not include the specific symbol, or when the text content before and/or after the specific symbol does not obtain the speaking role, it indicates that the current to-be-played sentence is a voice-over sentence, the processing unit 202 sets the voice tone of the to-be-played sentence to be a default voice tone.

That is to say, the voice timbre obtained by the processing unit 202 is the timbre to be embodied when the audio is played, so that the change between different roles can be more accurately embodied when the audio is played by obtaining the voice timbre of the sentence to be played.

Specifically, when the processing unit 202 obtains the scene sound effect of the to-be-played sentence according to the text content of the to-be-played sentence, the optional implementation manner that can be adopted is as follows: extracting scene words from the sentence to be played; and obtaining the scene sound effect of the sentence to be played according to the extracted scene words.

When the processing unit 202 obtains the scene sound effect of the to-be-played sentence according to the extracted scene word, the extracted scene word may be input into a sound effect generation model obtained by pre-training, so that the output result of the sound effect generation model is used as the scene sound effect of the to-be-played sentence.

That is to say, the scene sound effect obtained by the processing unit 202 is a scene to be reflected when the audio is played, and the scene sound effect of the sentence to be played is obtained, so that the atmosphere can be set off when the audio is played, and the user's feeling of being personally on the scene is increased.

Specifically, when the processing unit 202 obtains the background music of the sentence to be played according to the text content of the sentence to be played, the optional implementation manner that can be adopted is as follows: determining the emotion type of a statement to be broadcasted; and obtaining the background music of the sentence to be played according to the determined emotion type.

When obtaining the background music of the sentence to be played according to the determined emotion type, the processing unit 202 may further input the determined emotion type into a music generation model obtained by pre-training, so as to use an output result of the music generation model as the background music of the sentence to be played.

That is, the background music obtained by the processing unit 202, that is, the music for reflecting the atmosphere when the audio is played, enables the user to increase the listening feeling when playing the audio by obtaining the background of the sentence to be played.

In the embodiment, after the processing unit 202 obtains the speech emotion, the speech timbre, the scene sound effect and the background music of the sentence to be played, the generating unit 203 generates the target audio of the sentence to be played by using the determined speech emotion and speech timbre. The target audio generated by the generating unit 203 has corresponding emotion and tone.

When the generating unit 203 generates the target audio of the sentence to be played by using the determined speech emotion and speech timbre, the following optional implementation manners may be adopted: and inputting the sentence to be played, the voice emotion and the voice tone of the sentence to be played into a pre-trained audio generation model, and taking an output result of the audio generation model as a target audio of the sentence to be played.

In the present embodiment, after the target audio of the sentence to be played is generated by the generating unit 203, the generated target audio is played by the playing unit 204, and the determined scene sound effect and the background music are played.

When playing the generated target audio and playing the determined scene sound effect and background music, the playing unit 204 may start to play the scene sound effect and the background music at the same time after starting to broadcast the target audio.

In order to further improve the audio playing effect and enhance the user listening feeling, when the playing unit 204 plays the target audio and plays the scene sound effect, the optional implementation manners that can be adopted are: and when detecting the scene words in the playing target audio, playing the scene sound effect.

That is to say, the playing unit 204 plays the scene sound effect only when the target audio is played to the scene word, so that the playing of the scene sound effect is more accurate, and the listening sensation of the user is further correct.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

As shown in fig. 3, it is a block diagram of an electronic device of an audio playing method according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the apparatus 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM303, various programs and data required for the operation of the device 300 can also be stored. The computing unit 301, the ROM302, and the RAM303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 301 performs the various methods and processes described above, such as an audio playback method. For example, in some embodiments, the audio playback method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308.

In some embodiments, part or all of the computer program may be loaded onto and/or installed onto device 300 via ROM302 and/or communications unit 309. When the computer program is loaded into the RAM303 and executed by the computing unit 301, one or more steps of the audio playback method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the audio playback method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable audio playback device, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An audio playback method, comprising:

acquiring a sentence to be played;

obtaining voice emotion, voice tone, scene sound effect and background music of the sentence to be played according to the text content of the sentence to be played; obtaining the voice emotion of the sentence to be played according to the text content of the sentence to be played, wherein the obtaining of the voice emotion of the sentence to be played comprises the following steps: inputting a sentence to be played into an emotion recognition model obtained by pre-training, and taking the emotion type with the maximum probability value output by the emotion recognition model as the voice emotion of the sentence to be played; the obtaining the scene sound effect of the sentence to be played according to the text content of the sentence to be played comprises: extracting scene words from the sentences to be played; inputting the extracted scene words into a sound effect generation model obtained by pre-training, and taking an output result of the sound effect generation model as a scene sound effect of the sentence to be played; obtaining the voice tone of the sentence to be played according to the text content of the sentence to be played, including: determining a speaking role in a sentence to be played; extracting a role identification corresponding to the determined speaking role; extracting the tone corresponding to the character identifier according to a corresponding relation table between the character and the tone to be used as the voice tone of the sentence to be played; determining the speaking role in the sentence to be played comprises the following steps: under the condition that the sentence to be played contains a specific symbol, determining a speaking role in the sentence to be played from text contents positioned before and/or after the specific symbol in the sentence to be played; setting the voice tone of the sentence to be played as a default voice tone under the condition that the sentence to be played does not contain the specific symbol or when a speaking role is not obtained in the text content before and/or after the specific symbol; the obtaining the background music of the sentence to be played according to the text content of the sentence to be played comprises: inputting the determined emotion type into a music generation model obtained by pre-training, and taking an output result of the music generation model as background music of a sentence to be played;

generating the target audio of the sentence to be played by using the voice emotion and the voice tone, wherein the method comprises the following steps: inputting the sentence to be played, the voice emotion and the voice tone of the sentence to be played into an audio generation model obtained by pre-training, and taking an output result of the audio generation model as a target audio of the sentence to be played;

and playing the target audio and the scene sound effect and the background music.

2. The method of claim 1, wherein the playing the target audio and playing the scene sound effect comprises:

and when detecting that the scene words in the target audio are played, playing the scene sound effect.

3. An audio playback apparatus comprising:

the acquisition unit is used for acquiring the sentence to be played;

the processing unit is used for obtaining the voice emotion, the voice tone, the scene sound effect and the background music of the sentence to be played according to the text content of the sentence to be played; when the processing unit obtains the speech emotion of the sentence to be played according to the text content of the sentence to be played, the processing unit specifically executes: inputting a sentence to be played into an emotion recognition model obtained by pre-training, and taking the emotion type with the maximum probability value output by the emotion recognition model as the voice emotion of the sentence to be played; when the processing unit obtains the scene sound effect of the sentence to be played according to the text content of the sentence to be played, the following steps are specifically executed: extracting scene words from the sentence to be played; inputting the extracted scene words into a sound effect generation model obtained by pre-training, and taking an output result of the sound effect generation model as a scene sound effect of the sentence to be played; when the processing unit obtains the voice tone of the sentence to be played according to the text content of the sentence to be played, the processing unit specifically executes: determining a speaking role in a sentence to be played; extracting a role identification corresponding to the determined speaking role; extracting the tone corresponding to the role identification according to a corresponding relation table between the role and the tone to be used as the voice tone of the sentence to be played; when determining the speaking role in the sentence to be played, the processing unit specifically executes: under the condition that the sentence to be played contains a specific symbol, determining a speaking role in the sentence to be played from text contents before and/or after the specific symbol in the sentence to be played; under the condition that the sentence to be played does not contain the specific symbol, or when a speaking role is not obtained in the text content before and/or after the specific symbol, the voice tone of the sentence to be played is set as the default voice tone; when the processing unit obtains the background music of the sentence to be played according to the text content of the sentence to be played, the processing unit specifically executes: inputting the determined emotion type into a music generation model obtained by pre-training, and taking an output result of the music generation model as background music of a sentence to be played;

a generating unit, configured to generate a target audio of the to-be-played sentence by using the speech emotion and the speech timbre, and specifically execute: inputting a sentence to be played, and a voice emotion and a voice tone of the sentence to be played into a pre-trained audio generation model, and taking an output result of the audio generation model as a target audio of the sentence to be played;

and the playing unit is used for playing the target audio and playing the scene sound effect and the background music.

4. The apparatus according to claim 3, wherein when playing the target audio and playing the scene sound effect, the playing unit specifically performs:

5. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-2.

6. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-2.