WO2022184055A1

WO2022184055A1 - Speech playing method and apparatus for article, and device, storage medium and program product

Info

Publication number: WO2022184055A1
Application number: PCT/CN2022/078610
Authority: WO
Inventors: 谢映雪
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2021-03-04
Filing date: 2022-03-01
Publication date: 2022-09-09
Also published as: CN113010138A; CN113010138B

Abstract

Provided in the present application are a speech playing method and apparatus for an article, and a device and a computer-readable storage medium. The method comprises: presenting, in a content interface of an article, text content of the article and a speech playing function item corresponding to the article; receiving a speech playing instruction which is triggered for the article on the basis of the speech playing function item; in response to the speech playing instruction, playing the text content by means of speech; and during the process of playing the text content by means of speech, when the text content involves at least one character, playing the text content, which corresponds to the character, by using a timbre that matches the character features of the character.

Description

The voice playback method, device, equipment, storage medium and program product of the article

CROSS-REFERENCE TO RELATED APPLICATIONS

The embodiments of the present application are based on the Chinese patent application with the application number of 202110241752.7 and the filing date of March 4, 2021, and claim the priority of the Chinese patent application. The entire contents of the Chinese patent application are incorporated into the embodiments of the present application as refer to.

technical field

The present application relates to the field of computer technology, and in particular, to a voice playback method, apparatus, device, computer-readable storage medium, and computer program product of an article.

Background technique

With the development of Internet technology, multimedia information dissemination based on intelligent terminals is becoming more and more common, for example, articles are presented on mobile terminals for users to read.

In the related art, when the user reads the article, the user is provided with a voice playback function, that is, the text content of the article is played through the voice, but in the related art, all the content of the article is read aloud by one voice, so that the user cannot be immersed in the content of the article. in the content of the article.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a voice playback method, device, device, computer-readable storage medium, and computer program product of an article, which can make users feel immersed in the situation when playing text content through voice, and improve the effect of voice playback. of immersion.

The technical solutions of the embodiments of the present application are implemented as follows:

The embodiment of the present application provides a voice playback method of an article, including:

In the content interface of the article, the text content of the article and the voice playback function item corresponding to the article are presented;

Receive a voice play instruction for the article triggered based on the voice play function item;

In response to the voice play instruction, play the text content by voice;

During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre matching the character characteristics of the character is used for playing.

The embodiment of the present application provides a voice playback device of an article, including:

A presentation module, configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;

a receiving module, configured to receive a voice play instruction for the article triggered based on the voice play function item;

a first playing module, configured to play the text content by voice in response to the voice play instruction;

The second playing module is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use a character feature that matches the character of the character. sound to play.

Embodiments of the present application provide a computer device, including:

memory for storing executable instructions;

The processor is configured to implement the voice playing method of the article provided by the embodiment of the present application when executing the executable instructions stored in the memory.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the voice playback method of the article provided by the embodiments of the present application.

The embodiments of the present application provide a computer program product, including computer programs or instructions, which, when executed by a processor, implement the voice playback method of the articles provided by the embodiments of the present application.

The embodiment of the present application has the following beneficial effects:

By applying the embodiment of the present application, in the content interface of the article, the text content of the article and the voice playback function item of the corresponding article are presented; the voice playback instruction for the article triggered based on the voice playback function item is received; in response to the voice playback instruction, Play the text content by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre that matches the character characteristics of the character is used to play; When the text content is played, the timbre used is matched with the character characteristics corresponding to the text content, so that the user can be immersed in the scene when listening to the text content being played, and can be more immersed in the content of the article. The immersion brought by voice playback.

Description of drawings

FIG. 1 is a schematic structural diagram of a voice playback system 100 of an article provided by an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a computer device 500 provided by an embodiment of the present application;

3 is a schematic flowchart of a voice playback method of an article provided by an embodiment of the present application;

4 is a schematic diagram of a content interface provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application;

7 is a schematic diagram of a content interface provided by an embodiment of the present application;

8 is a schematic diagram of a content interface provided by an embodiment of the present application;

9 is a schematic diagram of a content interface provided by an embodiment of the present application;

10 is a schematic diagram of a content interface provided by an embodiment of the present application;

11 is a schematic diagram of a content interface provided by an embodiment of the present application;

12 is a schematic diagram of an emotion tag provided by an embodiment of the present application;

13 is a schematic diagram of speech parameters provided by an embodiment of the present application;

14 is a schematic diagram of the correspondence between emotions and speech parameters provided by an embodiment of the present application;

15 is a schematic diagram of a content interface provided by an embodiment of the present application;

16 is a schematic diagram of a content interface provided by an embodiment of the present application;

17 is a schematic diagram of an application architecture of a blockchain network provided by an embodiment of the present application;

18 is a schematic structural diagram of a blockchain in a blockchain network 600 provided by an embodiment of the present application;

FIG. 19 is a schematic diagram of a functional architecture of a blockchain network 600 provided by an embodiment of the present application;

FIG. 20 is a schematic flowchart of technical side implementation provided by an embodiment of the present application;

21A is a schematic diagram of a fundamental frequency point provided by an embodiment of the present application;

FIG. 21B is a diagram of tonal fifths provided by an embodiment of the present application;

22 is a schematic diagram of an acoustic model training process provided by an application embodiment;

23 is a schematic diagram of a construction process of a keyword dictionary provided by an embodiment of the present application;

24 is a schematic diagram of a personality-based emotion classification model provided by an embodiment of the present application;

FIG. 25 is a schematic flowchart of synthesizing audio provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail below with reference to the accompanying drawings. All other embodiments obtained under the premise of creative work fall within the scope of protection of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" can be the same or a different subset of all possible embodiments, and Can be combined with each other without conflict.

In the following description, the term "first\second\third" is only used to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that "first\second\third" is used in Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

Before further describing the embodiments of the present application in detail, the terms and terms involved in the embodiments of the present application are described, and the terms and terms involved in the embodiments of the present application are suitable for the following explanations.

1) Character characteristics, which are used to characterize the characteristics of the characters corresponding to the characters, and can also be understood as the character portrait characteristics of the characters, and the whole information of the tagged characters abstracted according to the basic information of the characters, such as the gender, age, and identity of the characters; For example, the character characteristics may include age characteristics, identity characteristics, gender characteristics, personality characteristics, health status characteristics, and the like.

2) Transaction (Transaction), which is equivalent to the computer term "transaction". Transaction includes operations that need to be submitted to the blockchain network for execution, not just transactions in a business context. In view of the conventional use of blockchain technology The term "transaction" is used in the embodiments of the present application.

3) Blockchain is a storage structure of encrypted and chained transactions formed by blocks.

4) Blockchain Network, a set of nodes that incorporate new blocks into the blockchain through consensus.

5) Ledger is a general term for blockchain (also known as ledger data) and a state database synchronized with the blockchain.

6) Smart Contracts, also known as Chaincode or application code, are programs deployed in the nodes of the blockchain network, and the nodes execute the smart contracts called in the received transactions to update the state database. Key-value operations to update or query data.

7) Consensus is a process in the blockchain network used to reach agreement on the transactions in the block among the multiple nodes involved, and the agreed block will be appended to the end of the blockchain , the mechanisms for achieving consensus include Proof of Work (PoW, Proof of Work), Proof of Stake (PoS, Proof of Stake), Proof of Share Authorization (DPoS, Delegated Proof-of-Stake), Proof of Elapsed Time (PoET, Proof of Elapsed Time), etc.

Referring to FIG. 1, FIG. 1 is a schematic diagram of the architecture of the voice playback system 100 of the article provided by the embodiment of the present application. In order to support an exemplary application, a terminal (exemplarily shows a terminal 400-1 and a terminal 400-2) through the network 300 is connected to the server 200, and the network 300 may be a wide area network or a local area network, or a combination of the two.

a terminal, used for presenting the text content of the article and the voice playback function item corresponding to the article in the content interface of the article; receiving a voice playback instruction for the article triggered based on the voice playback function item; the voice acquisition request of the text content is sent to the server;

The server 200 is configured to generate the voice of the text content in response to the voice acquisition request, and send the generated voice of the text content to the terminal;

The terminal is used to play the text content through the voice according to the received voice, and in the process of playing the text content through the voice, when the text content includes at least one character, for the text content corresponding to the character, use the character feature of the character. match the tone to play.

In some embodiments, the server 200 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN, Content Delivery Network), and big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.

Referring to FIG. 2, FIG. 2 is a schematic structural diagram of a computer device 500 provided by an embodiment of the present application. In practical applications, the computer device 500 may be the terminal or server 200 in FIG. 1, and the computer device is the terminal shown in FIG. 1. For example, a computer device that implements the voice playback method of the article in the embodiment of the present application will be described. The computer device 500 shown in FIG. 2 includes: at least one processor 510 , memory 550 , at least one network interface 520 and user interface 530 . The various components in electronic device 500 are coupled together by bus system 540 . It will be appreciated that the bus system 540 is configured to enable connection communication between these components. In addition to the data bus, the bus system 540 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 540 in FIG. 2 .

The processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.

User interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual display screens. User interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.

Memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 optionally includes one or more storage devices that are physically remote from processor 510 .

Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory. The non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory). The memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.

In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

The operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

A network communication module 552 for reaching other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: Bluetooth, Wireless Compatibility (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;

A presentation module 553 for enabling presentation of information (eg, a user interface for operating peripherals and displaying content and information) via one or more output devices 531 associated with the user interface 530 (eg, a display screen, speakers, etc.) );

An input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.

In some embodiments, the voice playback device for articles provided by the embodiments of the present application may be implemented in software. FIG. 2 shows the voice playback device 555 for articles stored in the memory 550, which may be in the form of programs and plug-ins. The software includes the following software modules: presentation module 5551, receiving module 5552, first playing module 5553 and second playing module 5554. These modules are logical, so any combination or further splitting can be performed according to the realized functions. The function of each module will be explained below.

In other embodiments, the voice playback device of the article provided by the embodiment of the present application may be implemented in hardware. As an example, the voice playback device of the article provided by the embodiment of the present application may be a processor in the form of a hardware decoding processor , which is programmed to execute the voice playback method of the article provided in the embodiment of the present application, for example, the processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.

Next, the voice playing method of the article provided by the embodiment of the present application will be described. In actual implementation, the voice playing method of the article provided by the embodiment of the present application may be implemented by the terminal alone, or by the server and the terminal collaboratively.

Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a voice playback method of an article provided by an embodiment of the present application, which will be described with reference to the steps shown in FIG. 3 .

Step 301: The terminal presents the text content of the article and the voice playback function item corresponding to the article in the content interface of the article.

In actual implementation, the terminal is provided with a client, such as a reading client, an instant messaging client, etc., and the terminal can present the text content of the article through the client. Here, the articles can be novels, prose, popular science articles, etc. The text content refers to the expression of written language, and refers to one or more characters with specific meanings. For example, the text content can be words, words, Phrases, sentences, paragraphs or articles.

Here, while presenting the text content of the article, the terminal may also present a voice play function item corresponding to the article, and the voice play function item is used to play the text content by voice when a trigger operation is received.

As an example, FIG. 4 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 4 , in the content interface of the article, the text content 401 of the article and the playback function item 402 of the corresponding article are presented.

Step 302: Receive a voice play instruction for the article triggered based on the voice play function item.

In actual implementation, when the user reads the text content of the presented article, the voice playback instruction for the article can be triggered based on the voice playback function item. Here, the voice playback instruction for the article can be triggered based on the trigger operation for the voice playback function item, The trigger operation includes, but is not limited to, a click operation, a double-click operation, a slide operation, and the like, and the embodiment of the present application does not limit the trigger operation. For example, when the user clicks the voice play function item 402 in FIG. 4 , the voice play instruction for the article can be triggered.

Step 303: In response to the voice play instruction, play the text content by voice.

In actual implementation, when the terminal receives the voice play instruction, it acquires voice data corresponding to the text content, and plays the voice data, so as to play the text content through voice.

Here, the voice data is generated based on the text content, and the process of generating the voice data may be performed on the terminal or on the server. For example, the terminal may generate and send the voice for the article in response to the voice playback instruction. The playback request is sent to the server, wherein the voice playback request carries the identifier of the article, and the server obtains the text content of the corresponding article based on the identifier of the article carried by the voice playback request, and generates voice data based on the text content, and returns the generated voice data to the The terminal plays the voice data. It should be noted that the voice data played in this application is generated intelligently, rather than pre-generated through voice recording of articles.

In some embodiments, when the terminal receives a voice play instruction, it starts to play the text content by voice, and during the process of playing the text content by voice, prompt information may be presented to prompt the user that the text content is being played by voice.

Here, the prompt information can be in a variety of forms, for example, the prompt information can be in the form of text, or in the form of images. In addition, there may be various ways of presenting the prompt information. For example, the prompt information can be presented in a floating form, or the prompt information can be presented in a certain presentation area in the content interface, such as the prompt information is presented at the top of the content interface. The embodiment does not limit the presentation form of the prompt information.

In some embodiments, when the prompt information is in the form of text, during the process of playing the text content through voice, the terminal presents a prompt box in a floating form, and presents the text prompt information in the prompt box; wherein the text prompt information is used for Indicates that text content is being played by voice.

In actual implementation, the presentation form of the prompt box is a floating form, that is, the prompt box is independent of the content interface and is suspended above the content interface. As an example, FIG. 5 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application. Referring to FIG. 5 , a prompt box 501 is presented in a floating form, and a text prompt message “You are listening to an intelligent recognition audiobook” is presented in the prompt box 501 .

Here, since the prompt box is presented in a floating form, the prompt box is movable, that is, the user can trigger the moving operation for the floating box. Correspondingly, the prompt information moves with the movement of the prompt box; in this way, when the prompt box blocks the content that the user wants to browse, the prompt box can be moved to avoid the prompt box from blocking the content that the user wants to browse, thereby improving the user experience. reading experience.

In practical applications, the presentation time of the prompt box may be the same as the start time of playing the text content by voice, that is, the prompt box is presented while the text content is played by voice. The presentation duration of the prompt box may be preset, that is, when the presentation duration of the prompt box reaches the preset duration, the prompt box will be cancelled; the presentation duration of the prompt box may also be the same as the duration of playing the text content by voice Consistent, that is, the prompt box is always presented in the process of playing the text content by voice, and when the text content is stopped by voice, the prompt box is canceled; the presentation time of the prompt box can also be controlled by the user, or That is, when the user triggers the close operation for the prompt box, the prompt box is canceled.

In some embodiments, in the process of presenting the prompt box, the presentation style of the prompt box and/or the presentation content in the prompt box may be adjusted, wherein the presentation style of the prompt box includes the shape, size, and presentation position of the prompt box. Wait.

In some embodiments, when the presentation duration of the text prompt information reaches the duration threshold, the terminal shrinks the prompt box, and switches the text prompt information in the prompt box to a play icon, wherein the play icon is used to indicate that the text content is being played by voice .

In actual implementation, the duration threshold can be preset, such as system settings, user settings, etc., when the text prompt information is presented, start timing to determine the presentation duration of the text prompt information, and adjust when the presentation duration reaches the duration threshold. The presentation style and content of the prompt box, that is, shrink the prompt box to reduce the size of the prompt box, and switch the presented text prompt information to the play icon. Here, the size of the shrunk prompt box is adapted to the content presented in the prompt box.

As an example, FIG. 6 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application. Referring to FIG. 6 , assuming that the duration threshold is 10 seconds, when the presentation duration of the text prompt information in FIG. 5 reaches 10 seconds, the The text prompt message "You are listening to an audiobook with intelligent recognition" is switched to the play icon 61 in FIG. 6 , and the prompt box is shrunk so that the size of the prompt box matches the size of the content in the prompt box.

In the embodiment of the present application, when the presentation duration of the text prompt information reaches the duration threshold, the prompt box is shrunk, and the text prompt information in the prompt box is switched to a play icon indicating that the text content is being played by voice, so as to avoid the problems caused by the content of the text prompt information. Too much, the prompt box covers too much text content for a long time, and then affects the reading experience of the text content.

Step 304: During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use the timbre matching the character characteristics of the character to play.

Here, the text content corresponding to the character refers to the text content associated with the character, such as the character's dialogue content, inner monologue, description content, etc.; the character feature can be a label abstracted from at least two basic information of the character, Corresponding to the basic information portrait of the character, for example, the character characteristics may include age information, gender information, identity information (such as a domineering president), gender information, personality information, health status information, abstract age characteristics, identity characteristics, Gender characteristics, personality characteristics, health status characteristics.

In actual implementation, the number of characters included in the text content may be one or more, wherein the number of characters is two or more. When the text content includes multiple characters, the characters and the timbres are in a one-to-one correspondence.

In practical applications, the text content of each character is played with the timbre that matches the character's character characteristics, that is, the character characteristics of multiple characters are obtained, and then the character characteristics of each character are matched with the timbre respectively to determine the The timbre that matches the character characteristics of each character; through the acquired timbre, the text content corresponding to the corresponding character is played.

Here, when the character characteristics of each character are matched with the timbre, the character characteristics of each character are matched with the character characteristics corresponding to the timbre; in some embodiments, the character characteristics can be performed by using corresponding tags (ie, character tags). Identification, for example, an age tag is used for identification, and an identity tag is used for identification. Correspondingly, the role characteristics of a specific role in this application include at least two types, that is, the role characteristics of a specific role can have at least two types. kind of labels. In actual implementation, multiple (that is, at least two) timbres can be pre-stored, and each timbre corresponds to at least two labels. During character feature matching, at least two labels corresponding to the character can be associated with the respective timbres. The corresponding tags are matched to determine the timbre that matches the character's character traits.

In practical applications, when there are at least two timbres matching the character characteristics of a character, one of the at least two timbres obtained by matching can be randomly selected as the target timbre, and the selected target timbre corresponding to the character can be used. The text content is played; it is also possible to obtain the matching degree of each timbre and the character characteristics. According to the matching degree, select the timbre with the highest matching degree with the character characteristics as the target timbre, and use the selected target timbre to play the text content corresponding to the character. ; It can also be to present options corresponding to at least two timbres obtained by matching for the user to select, take the timbre selected by the user as the target timbre, and use the selected target timbre to play the text content corresponding to the character.

In some embodiments, in order to use the target timbre to play the text content corresponding to the corresponding character, it is possible to first determine how to pronounce each word in the dialogue content, and then add the timbre feature of the target timbre, so that based on the timbre feature of the target timbre, Generates speech for text content, and then plays the generated speech.

In some embodiments, the terminal may present at least two timbre options corresponding to the target content in response to a selected operation on the target content in the text content; wherein each timbre option corresponds to a timbre; in response to the at least two timbre options The timbre selection operation triggered by the option takes the selected target timbre as the timbre of the character corresponding to the target content, so that in the process of playing the text content through voice, the target timbre is used for the text content corresponding to the character corresponding to the target content. play.

In actual implementation, the user can select the timbre of a character by himself, so that when the terminal plays the text content corresponding to the character, the timbre selected by the user is used for playing. First, the user selects the character whose timbre needs to be selected based on the presented text content. Here, the character is selected by selecting the text content, that is, the character corresponding to the selected target content is taken as the selected character. Then, after the target content is determined, at least two timbre options corresponding to the target content are presented. Here, when the timbre options are presented, according to the degree of matching between each timbre and the character characteristics of the character corresponding to the target content, the timbre options For example, the timbre option corresponding to the timbre with a higher degree of matching with the character characteristics of the character corresponding to the target content is presented at the front. Next, the user selects the timbre to be selected based on the presented at least two timbre options. The selection operation here may be a click operation on the timbre option corresponding to the target timbre, or a pressing operation on the timbre option corresponding to the target timbre , the trigger form of the selected operation is not limited here.

In practical applications, the at least two timbre options corresponding to the target content may be presented in the form of a drop-down list, an icon, or an image. The presentation forms of the at least two timbre options are not limited here. Here, at least two timbre options can be presented directly in the content interface, or a floating layer independent of the content interface can be presented, and at least two timbre options can be presented in the floating layer.

It should be noted that, the above-mentioned selection operation for the target content and tone selection operation may be performed before the text content is played by voice, and may also be performed during the process of playing the text content by voice.

As an example, FIG. 7 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 7 , the user selects the target content based on the presented text content. Here, the target content can be selected by clicking on the text, that is, when the user's click operation is received , take the statement presented at the clicked position as the target content, and present a floating layer in which at least two timbre options 701 are presented. An image of the character, and a textual description of the timbre presented, such as Silly White Sweet.

In some embodiments, before selecting a timbre, the user can audition each timbre, that is, after the terminal presents at least two timbre options corresponding to the target content, it can also present at least two timbre options; in response to the target timbre For the trigger operation of the corresponding audition function item, the target content is played by using the target timbre corresponding to the audition function item.

In actual implementation, each timbre option may correspond to an audition function item. After the user triggers a certain audition function item, the target timbre corresponding to the audition function item is determined, and then the target content is played based on the target timbre.

As an example, FIG. 8 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 8 , when the user selects the target content based on the presented text content, the target content can be selected by clicking on the text, that is, when the user's click is received Operation, take the statement presented at the clicked position as the target content, and present a floating layer, and present at least two timbre options 801 in the floating layer. An image of a cartoon character, and a textual description of the timbre, such as silly white sweet type; and an audition function item 802 is presented under each timbre option, and the audition function items correspond to the timbre options one-to-one. For example, when the user clicks on the When the audition function item under the sweet tone option is used, the target content, that is, the selected sentence, is played with the silly white sweet tone.

In some embodiments, the terminal may present at least two timbre options corresponding to the target content and a determination function item in response to a selection operation on the target content in the presented dialogue content; wherein each timbre option corresponds to a timbre; the response In the timbre selection operation triggered based on at least two timbre options, the selected target timbre is used to play the target content; in response to the triggering operation for determining the function item, the target timbre is used as the timbre of the role corresponding to the target content, so as to pass the voice During the process of playing the text content, the dialogue content of the character corresponding to the target content is played using the target timbre.

In practical applications, the user can switch the selected timbre before triggering the confirmation function item, and after the media selects the timbre, the selected timbre will be used to play the target content. In this way, the user can determine whether to select the timbre according to the playing sound. It avoids the need to re-select after selection error, and improves the efficiency of human-computer interaction.

In some embodiments, in the content interface of the article, a timbre selection function item is presented; in response to a trigger operation for the timbre selection function item, at least two characters in the article are presented; in response to a target character in the at least two characters The selection operation presents at least two timbres corresponding to the target roles; in response to the timbre selection operations triggered based on the at least two timbres, the selected target timbres are used as the timbres of the target roles, so that in the process of playing the text content by voice, For the dialogue content of the target character, use the selected target voice to play.

In practical applications, after receiving the timbre selection function item, the terminal can also present at least two characters in the article. Here, all characters in the article can be presented, or only some characters in the article can be presented. For example, only the current presentation can be presented. The role that appears in the chapter in which the text content is located. After presenting at least two characters in the article, the user can select one of them as the target character to select the target timbre of the target character. Here, after the target timbre is selected for one role, other roles may be selected from at least two roles to select timbres for other roles.

In this way, the user can not only select the timbre of the character corresponding to the session content in the current content interface, but also can select the timbre of the character corresponding to the unpresented session content, so that by triggering the timbre selection function item once, it is possible to select the timbre of the multiple presented timbres. The timbre of each character is selected, which improves the efficiency of human-computer interaction.

As an example, FIG. 9 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 9 , a timbre selection function item 901 is presented on the content interface. The selection interface presents all characters 902 in the article, such as character A, character B, and character C; when the user clicks on a character, such as clicking on "character A", multiple timbres matching the character characteristics of "character A" are presented 903, the user may select a timbre from the multiple presented timbres as the target timbre.

In some embodiments, the terminal may also present a timbre switching button for the text content during the process of playing the text content by voice; when receiving a trigger operation for the timbre switching button, the terminal may change the timbre corresponding to the currently playing content from the first One tone switches to the second tone.

In actual implementation, the embodiment of the present application provides a button for quickly switching the timbre, that is, the timbre switching button. During the voice playback process, the timbre switching button is used to switch the timbre corresponding to the sentence currently being played, wherein the first The timbre is the currently playing timbre, and the second timbre is the recommended timbre for switching, where the first timbre is different from the second timbre.

In practical applications, the second timbre corresponds to the currently playing sentence, and the second timbres corresponding to different sentences may be the same or different. Here, both the first timbre and the second timbre may be timbres that match the character characteristics of the character corresponding to the currently playing content. For example, when a certain dialogue content is played, the character of the character corresponding to the dialogue content is obtained. multiple timbres with matching characteristics, and then select one of the multiple timbres as the first timbre, and select one as the second timbre, first use the first timbre to play the dialogue content, and after receiving the trigger operation for the switch button, The first timbre is switched to the second timbre, that is, the dialogue content is played using the second timbre after switching.

Here, after the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre, the content belonging to the same role as the currently playing content is played using the second timbre.

In some embodiments, after the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre, the timbre switching button may be triggered again, and after receiving the trigger operation for the timbre switching button, the second timbre Switch to the third timbre, where the first timbre can be the same as the third timbre, or it can be different.

In some embodiments, during the process of playing the text content through voice, the terminal presents recommended timbre information for the target text content in the text content; wherein the recommended timbre information is used to indicate, based on the recommended timbre information, what information about the target text content is to be determined. The timbre of the corresponding character is switched.

In actual implementation, a timbre may be recommended for the user, and the target text content here may be the currently playing text content, or may be the text content in which the character characteristics of any corresponding character match the recommended timbre information. For example, according to the currently playing dialogue content, obtain a timbre that matches the character characteristics of the character in the current dialogue content, and generate recommended timbre information based on the timbre obtained by matching, for example, based on the timbre with the highest degree of matching, generate recommended timbre information; or, When a certain timbre is to be recommended, it is judged whether there is a character matching the timbre in the article, and if there is, the corresponding recommended timbre information is presented.

As an example, FIG. 10 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 10 , when it is recognized that the character characteristics of a certain character match the timbre of a certain star, recommended timbre information 1001 is presented, such as “Lin xx The voice of the fifth sister matches the voice of the fifth sister very well" to prompt the user to switch the tone of the fifth sister to the tone of Lin xx.

In some embodiments, when the recommended timbre information is presented, a timbre switching button matching the recommended timbre information is presented, and after receiving the user's trigger operation on the timbre switching button, the timbre corresponding to the corresponding dialogue content is switched to the recommended timbre The Voice indicated by the Voice information.

As an example, FIG. 11 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 11 , when the character characteristics of a character in the article match the timbre of a certain star, recommended timbre information 1101 is presented, such as “Lin xx The voice of xx matches the voice of the fifth junior sister very well", and the timbre switching button 1102 is displayed at the same time. When the user clicks the tone switching button 1102, the text content corresponding to the fifth junior sister is played using Lin xx's voice, such as the dialogue content of the fifth junior sister.

In some embodiments, when there is text content corresponding to the environment description information in the text content, when playing the text content corresponding to the environment description information, the terminal may use the environment music that matches the environment description information as the background music, and play background music.

In actual implementation, when there is text content corresponding to the environment description information in the text content, the environment description information in the text content is obtained. Here, a key dictionary of the environment description information can be preset, and the key dictionary stores various environment description information. Describe the keywords corresponding to the information, and then match the text content with the keywords in the key dictionary. When the text content contains text content that matches the keywords in the key dictionary, it is determined that there is text content corresponding to the environmental description information, And extract the text content that matches the keywords in the key dictionary, and match the text content with each ambient music to obtain ambient music that matches the environment description information.

As an example, when the environmental description information contained in the text content is a rainy night, ambient music that matches the rain can be obtained, and when the text content corresponding to the environmental description information is played, the music matching the rain can be obtained. Ambient music is played as background music.

In the present application, by adding ambient music as background music, users can integrate into the scene described by the text content, and further enhance the immersion brought by voice playback.

In some embodiments, the terminal can also play the text content in the following ways: determine the emotional color corresponding to each statement in the text content; Color; play the generated voice corresponding to each sentence.

In actual implementation, each sentence in the text content has a corresponding emotional color, especially for the dialogue content in the text content, the characters in the text are emotional when they speak, such as sadness, happiness, etc. . In the present application, by acquiring the emotional color corresponding to each sentence, the generated voice carries emotional color, so that the user can have an immersive feeling when hearing the voice.

In practical applications, the emotional color corresponding to each sentence is not only based on the sentence itself, but also needs to be combined with the context of the sentence to improve the accuracy of emotional color determination. For example, it is only possible to judge that the current character is crying based on "she said with tears at this time", but it is impossible to judge whether the emotional color corresponding to the sentence is crying with joy or crying, which needs to be judged in conjunction with the context.

In some embodiments, the terminal may determine the emotional color corresponding to each sentence in the text content by: extracting the emotional label of each sentence in the text content to obtain the emotional label corresponding to each sentence; using the extracted emotional label corresponding to each sentence , indicating the emotional color corresponding to the corresponding sentence; the terminal can generate the voice corresponding to each sentence based on the emotional color corresponding to each sentence in the following ways: respectively determine the voice parameters that match each emotional label, and the voice parameters include sound quality, rhythm, etc. At least one; based on each speech parameter, the speech of each sentence is generated.

In actual implementation, since the emotional color corresponding to the sentence is not only determined by the text information, but also affected by the environment in which the character is located in the article and the basic information of the character, the emotional label here includes at least one of the following: basic information, recognition Knowledge evaluation, psychological feelings.

FIG. 12 is a schematic diagram of an emotion tag provided by an embodiment of the present application. Referring to FIG. 12 , the emotion tag includes basic information, cognitive evaluation, and psychological feeling, wherein the cognitive evaluation includes discourse tendency and discourse style. For example, the discourse tendency may be Negative or affirmative, indifferent or enthusiastic; basic information includes age information (such as children, young people, etc.), gender information, identity information (such as domineering president); psychological feelings include positive feelings (such as comfort, sympathy, etc.) and negative feelings (such as grief, panic).

Here, the acquired emotional tags of a sentence may be one or more. After the emotional tags are acquired, the voice parameters that match the emotional tags can be determined directly based on the correspondence between the emotional tags and the voice parameters; It is possible to first perform emotion prediction based on multiple emotion tags, and then obtain speech parameters matching the emotion tags according to the correspondence between the predicted emotions and speech parameters. After acquiring the speech parameters, the speech of the corresponding sentence is generated based on the speech parameters.

The emotion parameter is explained here. FIG. 13 is a schematic diagram of speech parameters provided by an embodiment of the present application. Referring to FIG. 13 , the speech parameters include sound quality and rhythm, wherein the sound quality includes brightness, saturation, etc., and the rhythm includes pitch, speech rate, syllable interval, rhythm, intonation, etc.

Fig. 14 is a schematic diagram of the correspondence between emotions and speech parameters provided by an embodiment of the present application. Referring to Fig. 14, different emotions correspond to different speech parameters. For example, when the emotion is joy, the speech rate is brisk, but sometimes slower; when the emotion is anger , speak a little faster.

In some embodiments, the terminal may also present a cartoon character when playing the dialogue content in the text content, and play an animation in which the cartoon character uses a timbre to read the dialogue content; wherein the cartoon character and the character corresponding to the dialogue content characteristics match.

In actual implementation, the terminal can also obtain a cartoon character matching the character characteristic according to the character characteristic of the character corresponding to the dialogue content, and play an animation in which the cartoon character reads the dialogue content aloud using the timbre of the character characteristic, so, Users are able to integrate into the scene described in the article from both hearing and vision, bringing users a better sense of immersion.

As an example, FIG. 15 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 15 , the character corresponding to the dialogue content here is a child, and a cartoon character 1501 in the image of a child is presented on the content interface, and the cartoon character 1501 is played to Animation of children's voice reading the content of the dialogue.

In some embodiments, the dialogue content in the text content is played using a timbre that matches the character characteristics of the character corresponding to the dialogue content: from the content of the article, the basic information of the character corresponding to the dialogue content is extracted; The timbre that matches the basic information; the acquired timbre is used to play the dialogue content in the text content.

The basic information includes at least one of the following: age information, gender information, and identity information. In actual implementation, the basic information of the role corresponding to the dialogue content is extracted from the content of the article, which can be extracted from the presented text content or from the unpresented text content. It is understandable that here is the Combine all the text content describing the role in the article to extract the basic information corresponding to the role.

In some embodiments, during the process of playing the text content by voice, the terminal can also display the currently playing sentence differently; as the voice playing progresses, the text content of the article is scrolled and presented, so that the presented text content is different from the voice. match the progress of the playback.

In actual implementation, the user can listen and watch, that is, while listening to the voice to play the text content, he can browse the presented text content. In order to remind the user of the specific content to play, the currently playing sentence The user can quickly find the currently playing sentence. As an example, FIG. 16 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 16 , a gray background color is used to present the currently playing sentence 1601 to distinguish it from other sentences.

Here, as the voice playback progresses, the text content of the article can be scrolled and presented, so that the currently playing sentence is always in the middle of the screen.

In some embodiments, during the process of playing the text content by voice, the terminal can also display the currently playing sentence differently; as the voice playing progresses, turn pages to present the text content of the article, so that the presented text content is different from the text content of the article. Match the progress of the voice playback.

In actual implementation, after the currently presented text content is played, the page turning process can be performed to present the text content of the next page of the article, and continue to play the text content of the next page by voice, so that the presented text content and the voice play progress to match.

In some embodiments, the terminal can also obtain the character characteristics of each character from the content of the article, and store the character characteristics of each character in the blockchain network; in this way, when other terminals need to play the text content of the article by voice, The character characteristics of each character in the article can be obtained directly from the blockchain.

Here, the embodiment of the present application can also combine the blockchain technology, after the terminal obtains the character characteristics of each character and obtains the character characteristics of each character, generates a transaction for storing the character characteristics of each character, and submits the generated transaction to the blockchain. The node of the network, so that the node can store the role characteristics of each role to the blockchain network after consensus on the transaction; before storing to the blockchain network, the terminal can also hash the role characteristics of each role to obtain the corresponding role characteristics. Summary information of character characteristics; store the obtained summary information of character characteristics corresponding to each character to the blockchain network. In the above manner, the character characteristics of each character are prevented from being tampered with, and the security of the character characteristics of each character is improved.

Referring to FIG. 17, FIG. 17 is a schematic diagram of an application architecture of a blockchain network provided by an embodiment of the present application, including a business entity 400 and a blockchain network 600 (exemplarily showing a consensus node 610-1 to a consensus node 610-3) , and the authentication center 700, which will be described separately below.

The type of the blockchain network 600 is flexible and diverse, for example, it can be any one of a public chain, a private chain or a consortium chain. Taking the public chain as an example, the electronic equipment of any business entity, such as user terminals and servers, can access the blockchain network 600 without authorization; taking the alliance chain as an example, the business entity will govern after obtaining authorization. The computer equipment (for example, a terminal/server) can access the blockchain network 600, and at this time, for example, become a client node in the blockchain network 600.

In some embodiments, the client node may only serve as an observer of the blockchain network 600, that is, provide the function of supporting business entities to initiate transactions (for example, for storing data on the chain or querying data on the chain), for the blockchain network The functions of the consensus node 610 of 600, such as ordering function, consensus service and ledger function, etc., can be implemented by the client node by default or selectively (eg, depending on the specific business needs of the business entity). Therefore, the data and business processing logic of the business subject can be migrated to the blockchain network 600 to the greatest extent, and the trustworthiness and traceability of the data and business processing process can be realized through the blockchain network 600 .

The consensus node in the blockchain network 600 receives the transaction submitted by the client node of the business entity 400, executes the transaction to update the ledger or query the ledger, and various intermediate or final results of the executed transaction can be returned to the client node of the business entity. show.

For example, the client node 410 can subscribe to events of interest in the blockchain network 600, such as transactions occurring in a specific organization/channel in the blockchain network 600, and the consensus node 610 pushes corresponding transaction notifications to the client node 410 , thereby triggering the corresponding business logic in the client node 410 .

The following describes an exemplary application of the blockchain by taking the business entity accessing the blockchain network to realize the voice playback of the article as an example.

Referring to FIG. 17 , the business subject 400 involved in the voice playback of the article registers and obtains a digital certificate from the certification center 700. The digital certificate includes the public key of the business subject, and the public key and identity information of the business subject signed by the certification center 700. The digital signature is used to attach to the transaction together with the digital signature of the business subject for the transaction, and is sent to the blockchain network for the blockchain network to extract the digital certificate and signature from the transaction to verify the reliability of the message (ie Whether it has not been tampered with) and the identity information of the business subject sending the message, the blockchain network will verify it according to the identity, such as whether it has the authority to initiate transactions. Clients running on computer equipment (such as terminals or servers) under the jurisdiction of the business entity can request access to the blockchain network 600 to become client nodes.

The client node 410 of the business entity 400 is used to play the text content by voice. For example, in the content interface of the article, the text content of the article and the voice playback function item of the corresponding article are presented; The voice play instruction; in response to the voice play instruction, the text content is played by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the character characteristics of the character are used. The matching Voice is played. Here, the terminal acquires the character characteristics of each character in the article, and sends the character characteristics of each character to the blockchain network 600 .

Among them, the operation of sending the role characteristics of each role to the blockchain network 600 can set business logic in the client node 410 in advance. When the terminal obtains the role characteristics of each role in the article, the client node 410 sends the role of each role The features are automatically sent to the blockchain network 600 , or the business personnel of the business entity 400 can log in in the client node 410 , manually package the role features of each role, and send them to the blockchain network 600 . When sending, the client node 410 generates a transaction corresponding to the storage operation according to the role characteristics of each role, and specifies the smart contract to be called to realize the storage operation and the parameters passed to the smart contract in the transaction, and the transaction also carries the client node. 410, the signed digital signature (for example, obtained by encrypting the transaction digest using the private key in the digital certificate of the client node 410), and broadcasting the transaction to the consensus nodes in the blockchain network 600 (such as Consensus node 610-1, consensus node 610-2, consensus node 610-3).

When the consensus node in the blockchain network 600 receives the transaction, it verifies the digital certificate and digital signature carried in the transaction. After the verification is successful, it is confirmed whether the business subject 400 has the transaction status according to the identity of the business subject 400 carried in the transaction. Any one of authority, digital signature and authority verification will cause the transaction to fail. After the verification is successful, the consensus node's own digital signature (for example, obtained by encrypting the transaction digest with the private key of the consensus node 610-1), continues to broadcast in the blockchain network 600.

After receiving the successfully verified transaction, the consensus node in the blockchain network 600 fills the transaction into a new block and broadcasts it. When a consensus node in the blockchain network 600 broadcasts a new block, it will perform a consensus process on the new block. If the consensus is successful, the new block will be appended to the end of the blockchain stored by itself, and the status will be updated according to the transaction result. Database, execute transactions in the new block: For transactions that submit and update the character characteristics of each character, add the character characteristics of each character in the state database.

As an example of a blockchain, see FIG. 18 , which is a schematic structural diagram of a blockchain in a blockchain network 600 provided by this embodiment of the application. The header of each block may include hashes of all transactions in the block. It also contains the hash value of all transactions in the previous block. The record of the newly generated transaction is filled into the block and after the consensus of the nodes in the blockchain network, it will be appended to the end of the blockchain. The chain-like growth is formed, and the chain-like structure based on the hash value between the blocks ensures the tamper-proof and anti-forgery of the transactions in the blocks.

The following describes an exemplary functional architecture of the blockchain network provided by the embodiment of the present application. Referring to FIG. 19 , FIG. 19 is a schematic diagram of the functional architecture of the blockchain network 600 provided by the embodiment of the present application. The blockchain network includes an application layer 601 , the consensus layer 602, the network layer 603, the data layer 604 and the resource layer 605, which will be described separately below.

The resource layer 605 encapsulates the computing resources, storage resources and communication resources for realizing each consensus node in the blockchain network 600 .

The data layer 604 encapsulates various data structures that implement the ledger, including a blockchain implemented as files in a file system, a key-value state database, and proofs of existence (eg, a hash tree of transactions in blocks).

The network layer 603 encapsulates the functions of point-to-point (P2P, Point to Point) network protocol, data dissemination mechanism and data verification mechanism, access authentication mechanism and business subject identity management.

Among them, the P2P network protocol realizes the communication between consensus nodes in the blockchain network 600, the data dissemination mechanism ensures the dissemination of transactions in the blockchain network 600, and the data verification mechanism is used based on cryptographic methods (such as digital certificates, digital signature, public/private key pair) to achieve the reliability of data transmission between consensus nodes; the access authentication mechanism is used to authenticate the identity of the business entity joining the blockchain network 600 according to the actual business scenario, and when the authentication is passed The business entity is given the permission to access the blockchain network 600; the business entity identity management is used to store the identity of the business entity allowed to access the blockchain network 600, as well as the permission (for example, the type of transaction that can be initiated).

The consensus layer 602 encapsulates a mechanism (ie, a consensus mechanism) for consensus nodes in the blockchain network 600 to reach consensus on blocks, and functions of transaction management and ledger management. The consensus mechanism includes consensus algorithms such as POS, POW, and DPOS, and supports the pluggability of consensus algorithms.

Transaction management is used to verify the digital signature carried in the transaction received by the consensus node, verify the identity information of the business entity, and determine whether it has the authority to conduct transactions according to the identity information (read relevant information from the business entity identity management); For authorized business entities accessing the blockchain network 600, they all have digital certificates issued by the certification center. The business entities use the private key in their digital certificates to sign the submitted transactions, thereby declaring their legal identity.

Ledger management is used to maintain the blockchain and state database. For the consensus block, append it to the end of the blockchain; execute the transaction in the consensus block, update the key-value pair in the state database when the transaction includes an update operation, and query the state database when the transaction includes a query operation and returns the query result to the client node of the business principal. Supports query operations in various dimensions of the state database, including: querying blocks according to block serial numbers (such as transaction hash values); querying blocks according to block hash values; querying blocks according to transaction serial numbers; Query transactions by transaction serial number; query the account data of the business entity according to the account number (serial number) of the business entity; query the blockchain in the channel according to the channel name.

The application layer 601 encapsulates various services that the blockchain network can implement, including transaction traceability, certificate storage, and verification.

Applying the above embodiment, by presenting the text content of the article and the voice playback function item of the corresponding article in the content interface of the article; receiving a voice playback instruction for the article triggered based on the voice playback function item; in response to the voice playback instruction, by The text content is played by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre matching the character is used to play; in this way, since the text content is played When playing, the timbre used matches the character corresponding to the text content, so that the user can be immersed in the situation when listening to the text content being played, and can be more immersed in the content of the article, which improves the effect of voice playback. immersion.

Below, an exemplary application of the embodiments of the present application in a practical application scenario will be described. Taking the text content corresponding to the role as the dialogue content as an example, in actual implementation, the terminal presents the text content of the article, and the user browses the presented text content. During the browsing process, the listening function can be enabled, for example, the user clicks to play After the function item, the text content of the article is played by voice; during the playback process, when it is recognized that there is dialogue content in the article, the timbre of the character characteristics of the character corresponding to the dialogue content is obtained, and the voice of the character corresponding to the dialogue content is used. The timbre of the character features generates the voice of the dialogue content, and adds emotional color to the voice according to the emotional color corresponding to the dialogue content; when it is recognized that there is environmental description information in the article, for the text content containing the environmental description information, in the corresponding In the speech of the text content, ambient music that matches the environment description information is added as background music.

As an example, referring to FIGS. 4-6 , in the content interface of the article, the text content 401 of the article and the play function item 402 of the corresponding article are presented; when the user clicks the play function item 402, the terminal starts to play the text content of the article by voice , the prompt box 501 is presented in a floating form, and the text prompt information "You are listening to the intelligent recognition audiobook" is presented in the prompt box 501; when the presentation duration of the text prompt information in FIG. 5 reaches the duration threshold, the The text prompt information is switched to the play icon 61 in FIG. 6 , and the prompt box is shrunk so that the size of the prompt box matches the size of the content in the prompt box.

In practical applications, the user can independently select the timbre for the character in the article, that is, the user can independently select the timbre according to their own preferences. For example, first, the user selects a character whose timbre needs to be selected based on the presented text content. Here, the character is selected by selecting the text content, that is, the character corresponding to the selected target content is taken as the selected character. Then, after the target content is determined, at least two timbre options corresponding to the target content are presented; then, the user selects the timbre to be selected based on the presented at least two timbre options.

For example, referring to FIG. 7 , the user selects the target content based on the presented text content. Here, the target content can be selected by clicking on the text. That is, when the user's click operation is received, the sentence presented at the click position is used as the target content, and a floating layer, at least two timbre options 701 are presented in the floating layer, where the timbre options are presented in the form of a combination of graphics and text, that is, images containing cartoon characters matching the timbres are presented, and textual descriptions of the timbres are presented, such as silly white Sweet type, user can make timbre selection based on presented timbre options.

Here, in the timbre selection process, the user can audition each timbre to be selected, that is, the user can trigger the audition operation for the timbre, the terminal determines the timbre that the user wants to audition, and plays the selected target through the timbre to be auditioned. content, to realize the audition of the timbre. In this way, the user can select the timbre according to the auditioned voice, which is more in line with the real scene and improves the user experience.

In some embodiments, when the terminal recognizes that the character characteristics of a character in the article match the recommended timbre, a floating layer may pop up, and the recommended timbre information will be displayed in the floating layer, and the information that matches the recommended timbre will be displayed. When receiving the trigger operation of the timbre switching button by the user, the timbre corresponding to the currently playing dialogue content can be switched to the timbre indicated by the recommended timbre information.

For example, referring to FIG. 11 , when it is recognized that the character characteristics of a character in the article match the timbre of a certain star, the recommended timbre information 1101 is presented, such as “Lin xx’s voice matches the voice of the fifth junior sister very well”, and the timbre is presented at the same time Switch button 1102, when the user clicks the tone switch button 1102, the terminal responds to the click operation and switches the currently used tone to Lin xx's tone, that is, Lin xx's voice is used to play the dialogue content of the fifth sister after the switch.

The technical implementation process of the present application will be described below. FIG. 20 is a schematic flowchart of the technical side implementation provided by the embodiment of the present application. Referring to FIG. 20 , the voice playback method of the article provided by the embodiment of the present application includes:

Step 2001: The terminal collects audio data.

In actual implementation, the terminal first starts recording and collects the required audio data to build an emotional corpus. Here, emotional corpus is an important basis for the research on emotional speech synthesis. In the process of collecting audio data, it is necessary to analyze the collected audio data. Screening, for example, after starting the recording, the terminal performs decibel detection on the collected audio data. If the background sound in the collected audio data is noisy, the collected audio data is filtered out and re-recorded until the screen meets the requirements (no audio is present). quality issues) audio data. It should be noted that the recording can be recorded segment by segment. After the audio data corresponding to the recording of each segment is collected, the collected audio data can be uploaded to the server for detection. When the audio data is detected If there is an audio quality problem, re-record.

When recording, you can record voices with different emotions in different scenarios, such as declarative sentences, interrogative sentences and exclamation sentences. The recorded audio data needs to be marked by the praat tool, such as the fundamental frequency, syllable boundary, paralinguistic information, etc. of the audio data. These information are used to add emotional state labels and emotional keys when training the model later. Annotation information for word attributes.

As an example, FIG. 21A is a schematic diagram of a fundamental frequency point provided by an embodiment of the present application. Referring to FIG. 21A, the figure shows a graph of the fundamental frequency points of “Ma” and “Ma”, wherein the tone of “Ma” is Yin-ping , its corresponding curve is a curve that is close to the level, the tone of "ma" is positive, and the corresponding curve is a curve that changes from bottom to top; Figure 21B is a five-degree value diagram of the tone provided by the embodiment of the application, see Figure 21B , the graph is the same as the curve in the fundamental frequency point diagram. It is understandable that even if there is no speech, it is possible to know when to pronounce "mama" and when to pronounce "ma" according to the fundamental frequency point and the tone fifth value map.

Step 2002: Train the acoustic model.

After the terminal obtains the audio data, it preprocesses the audio data. The preprocessing here includes pre-emphasis, framing and other processing. The purpose of these operations is to eliminate the confusion caused by the human vocal organ itself and the device that collects the voice signal. In order to make the signal obtained by subsequent speech processing more uniform and smooth, it provides high-quality parameters for the extraction of signal parameters to improve the quality of speech processing. After the preprocessing is completed, the preprocessed audio data is stored in the database, and the acoustic model is trained based on the stored audio data. For example, the acoustic model can learn how to pronounce each pronunciation and the timbre characteristics, so as to obtain the required acoustic model. .

To add emotional color to speech, an acoustic model can be trained. When training an acoustic model, the audio data is first subjected to acoustic analysis. Here, since the prosody of Chinese is mostly processed by syllables, the prosodic features of syllables play a very important role in the prosody analysis of toned syllables, and the speech parameters can be divided into: tone quality and rhythm. Wherein, the sound quality may include brightness and saturation; the temperament may include pitch, speech rate, syllable interval, and the like. For example, when a person expresses excitement, he speaks at a fast rate, with a high pitch, and may have a certain breathing sound. In this way, information such as fundamental frequency parameters and spectral parameters under the basic emotional color can be obtained.

Then the acoustic model is trained, and the acoustic model here adopts a Hidden Markov Model (HMM, Hidden Markov Model). Figure 22 is a schematic diagram of the training process of the acoustic model provided by the application embodiment. The fundamental frequency parameter is extracted to obtain the fundamental frequency parameter, and the spectral parameter is extracted for the speech signal in the speech corpus, and then the hidden Markov model is trained according to the fundamental frequency parameter and the spectral parameter. The speech corpus here is constructed based on the above-mentioned audio data stored in the library.

The function of the spectral parameters and fundamental frequency parameters here is to make the synthesized sentences more smooth and natural. The spectral parameters are represented by the Mel Frequency Ceptrum Coefficient (MFCC, Mel Frequency Ceptrum Coefficient) and its first-order and second-order delta coefficients. The fundamental frequency parameters are represented by The fundamental frequency F0 and its first-order second-order delta coefficients are represented.

The Mel cepstral coefficient is a classic speech feature, which is a feature parameter extracted based on the characteristics of the human auditory domain, and is an engineering simulation of the human auditory characteristics. In addition to the perception of pitch, human auditory perception also includes the perception of loudness. The human ear's perception of loudness is related to the sound frequency band. Transforming the spectrum of the speech signal into the perceptual frequency domain can better simulate the human hearing process. The meaning of Mel frequency is that 1 Mel is 1/1000 of the degree of pitch perception at 1000 Hz. The fundamental frequency F0 is the lowest frequency of the filter application range.

Step 2003: Synthesize audio.

In actual implementation, first input the text of the article, preprocess the text of the article, first segment the text, convert the text into a sentence composed of words, and then label the sentence at the phoneme level, syllable level, and word level. Synthesize helpful information.

Here, the text needs to be analyzed step by step, such as word, sentence, chapter, book. There are n consecutive grams, gram is the word that we have filtered through a specific threshold), and keyword extraction is performed; Keywords related to emotional tags, such as character, mood, scene, gender, etc., are filtered out of the dictionary.

FIG. 23 is a schematic diagram of the construction process of the keyword dictionary provided by the embodiment of the present application. Referring to FIG. 23, a large-scale text corpus is first constructed to train a word vector model; etc., since the novel tags and general databases have been screened, a seed dictionary is constructed based on the novel tags and general databases; then, model training is performed based on the word vector model and the seed dictionary to predict new words based on the model obtained by training; The obtained new words are added to the keyword dictionary to construct the keyword dictionary.

Further, emotion classification can be carried out based on the character of the character in the article through the emotion classification model. Figure 24 is a schematic diagram of the character-based emotion classification model provided by the embodiment of the present application, and the emotion label related to the character of the character in the article can be extracted in the following manner: The word vector representation of the words in the text is obtained by Word2Vec (a tool for training word vector models), and then the word vector matrix in the paragraph or chapter is obtained, and the word vector matrix is input into the character-based text analyzer 2401 to obtain different types of For text groups, different types of text groups are input into corresponding types of classifiers 2402, and finally the output results of each classifier are fused to obtain the final classification result. Among them, C, A, and E refer to the three dimensions of extroversion, pleasantness and responsibility, respectively, and H and L respectively refer to the level of the personality value of each dimension. For example, HA means high agreeableness, HC means more extroversion, and LE means low responsibility type. Wait.

Through the above algorithm, emotional tags required for speech synthesis can be obtained, such as novel tags, basic information (character, identity, age, gender), and cognitive evaluation (environment, emotion). Then, based on these emotional labels, emotion prediction is performed to predict the emotional color attached to the person when they say the corresponding sentence.

Because emotional color is not only determined by text information, but also affected by the environment and status of the characters in the article. Based on this, the present application infers the emotional color of the character from the context of the text, so that the correct speech can be synthesized smoothly. For example, "she said in tears at this time", at this time we have to predict whether her emotional color is crying with joy or crying with sadness.

After the emotional color is predicted, the audio is synthesized by combining the emotional color. Here, the key to synthesizing speech with emotional colors is to obtain fundamental frequency parameters. The reason why people can distinguish different emotional colors from speech is that speech contains differences in fundamental frequency parameters that can reflect emotions. FIG. 25 is a schematic flowchart of a synthesized audio provided by an embodiment of the present application. Referring to FIG. 25 , the process of synthesizing audio includes:

Step 2501: Parse the text.

Here, parsing text includes syntactic parsing and semantic parsing, wherein syntactic parsing includes part-of-speech tagging, word parsing, and pronunciation parsing.

Step 2502: Emotion tag extraction.

Here, the extracted emotional tags include novel tags, basic information (character, identity, age, gender), and cognitive evaluation (environment, emotion).

Step 2503: Label the speech.

In actual implementation, the speech is annotated by the extracted emotional labels. Here, the labeling logic is the same as when training the acoustic model, that is, adjusting the fundamental frequency parameters and other information. In actual implementation, the fundamental frequency parameters output by the HMM model are obtained, and the fundamental frequency parameters output by the HMM model are adjusted based on the emotion labels to obtain the final fundamental frequency parameters.

Step 2504: Synthesize audio.

Through synthesis filters, audio is synthesized based on fundamental frequency parameters and spectral parameters output by the HMM model.

The application of the above-mentioned embodiment enables the user to be immersed in the scene while listening to the book, and can more immersely enter the scene of the novel, thereby improving the user experience and usage time.

The following will continue to describe the exemplary structure of the voice playback device 555 of the article provided by the embodiment of the present application implemented as a software module. In some embodiments, as shown in FIG. Software modules can include:

The presentation module 5551 is configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;

A receiving module 5552, configured to receive a voice play instruction for the article triggered based on the voice play function item;

The first playing module 5553 is configured to play the text content by voice in response to the voice play instruction;

The second playing module 5554 is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use the text content that matches the character. sound to play.

In some embodiments, the presentation module is further configured to present a prompt box in a floating form during the process of playing the text content by voice, and

Present text prompt information in the prompt box;

The text prompt information is used to prompt that the text content is being played by voice.

In some embodiments, the presentation module is further configured to shrink the prompt box when the presentation duration of the text prompt information reaches a duration threshold, and

The text prompt information in the prompt box is switched to a play icon; wherein the play icon is used to indicate that the text content is being played by voice.

In some embodiments, the second playing module is further configured to, in response to a selected operation on target content in the text content, present at least two timbre options corresponding to the target content; wherein each of the The timbre option corresponds to a timbre;

In response to the timbre selection operation triggered based on the at least two timbre options, the selected target timbre is used as the timbre of the character corresponding to the target content, to

In the process of playing the text content by voice, the text content corresponding to the character corresponding to the target content is played using the target timbre.

In some embodiments, the first playing module is further configured to present the audition function items of the at least two timbres;

In response to the triggering operation for the audition function item corresponding to the target tone color, the target content is played by using the target tone color corresponding to the audition function item.

In some embodiments, the first playback module is also configured to present a timbre selection function item in the content interface of the article;

In response to a trigger operation for the timbre selection function item, present at least two characters in the article;

In response to a selection operation for a target character among the at least two characters, presenting at least two timbres corresponding to the target character;

In response to the timbre selection operation triggered based on the at least two timbres, the selected target timbre is used as the timbre of the target character, to

During the process of playing the text content by voice, the text content corresponding to the target character is played using the target timbre.

In some embodiments, the first playing module is further configured to present a tone switching button for the text content during the process of playing the text content by voice;

When receiving a trigger operation for the timbre switching button, the timbre corresponding to the text content is switched from the first timbre to the second timbre.

In some embodiments, the first playing module is further configured to, during the process of playing the text content by voice, when playing the dialogue content in the text content, present the target text in the text content Recommended tone information for the content;

The recommended timbre information is used to instruct to switch the timbre of the character corresponding to the target text content based on the recommended timbre information.

In some embodiments, the first playing module is further configured to, when there is text content corresponding to the environment description information in the text content, when the text content corresponding to the environment description information is played, the text content corresponding to the environment description information is played. The ambient music that matches the environment description information is used as the background music, and the background music is played.

In some embodiments, the first playing module is further configured to determine the emotional color corresponding to each sentence in the text content;

Based on the emotional color corresponding to each sentence, the voice corresponding to each of the sentences is respectively generated, so that the voice carries the corresponding emotional color;

Play the generated voice corresponding to each of the sentences.

In some embodiments, the first playback module is further configured to perform emotional tag extraction on each sentence in the text content to obtain emotional tags corresponding to each of the sentences, where the emotional tags include at least one of the following: basic information, cognitive evaluation, psychological feelings;

Using the emotion labels corresponding to the extracted sentences to represent the emotional colors corresponding to the corresponding sentences;

Respectively determine the voice parameters that match each of the emotional tags, and the voice parameters include at least one of sound quality and melody;

Based on each of the speech parameters, the speech of each of the sentences is generated.

In some embodiments, the first playing module is further configured to present a cartoon character when playing the dialogue content in the text content, and play an animation in which the cartoon character uses the timbre to read the dialogue content aloud ;

Wherein, the cartoon characters match the character characteristics of the characters in the dialogue content.

In some embodiments, the first playback module is further configured to extract, from the content of the article, portrait information of the character corresponding to the content of the dialogue;

Obtain the timbre that matches the profile information;

The dialogue content in the text content is played by using the acquired timbre adapted to the portrait information.

In some embodiments, the first playing module is further configured to differentiate and display the currently playing sentences during the process of playing the text content by voice;

As the voice playback progresses, the text content of the article is scrolled and presented so that the presented text content matches the progress of the voice playback.

In some embodiments, the first playback module is also configured to display the currently played sentences differently in the process of playing the text content by voice;

As the voice playback progresses, the text content of the article is presented by page turning, so that the presented text content matches the progress of the voice playback.

Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the voice playback method of the above-mentioned article in the embodiment of the present application.

The embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application, for example , as shown in Figure 3.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.

In some embodiments, executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

As an example, executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document One or more scripts in , stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).

As an example, executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. Any modifications, equivalent replacements and improvements made within the spirit and scope of this application are included within the protection scope of this application.

Claims

A voice playback method of an article, the method is executed by a computer device, and the method comprises:

In the content interface of the article, the text content of the article and the voice playback function item corresponding to the article are presented;

Receive a voice play instruction for the article triggered based on the voice play function item;

In response to the voice play instruction, play the text content by voice;

During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre matching the character characteristics of the character is used for playing.
The method of claim 1, wherein the method further comprises:

During the process of playing the text content by voice, a prompt box is presented in a floating form, and text prompt information is presented in the prompt box;

The text prompt information is used to prompt that the text content is being played by voice.
The method of claim 2, wherein after the text prompt information is presented in the prompt box, the method further comprises:

When the presentation duration of the text prompt information reaches the duration threshold, shrink the prompt box, and switch the text prompt information in the prompt box to a play icon;

The play icon is used to indicate that the text content is being played by voice.
The method of claim 1, wherein the method further comprises:

In response to the selection operation for the target content in the text content, present at least two timbre options corresponding to the target content; wherein each of the timbre options corresponds to a timbre;

In response to the timbre selection operation triggered based on the at least two timbre options, the selected target timbre is used as the timbre of the character corresponding to the target content, to

In the process of playing the text content by voice, the text content corresponding to the character corresponding to the target content is played using the target timbre.
The method of claim 4, wherein after the presenting at least two timbre options corresponding to the target content, the method further comprises:

presenting the audition function items of the at least two timbres;

In response to the triggering operation for the audition function item corresponding to the target tone color, the target content is played by using the target tone color corresponding to the audition function item.
The method of claim 1, wherein the method further comprises:

In the content interface of the article, the timbre selection function item is presented;

In response to a trigger operation for the timbre selection function item, present at least two characters in the article;

In response to a selection operation for a target character among the at least two characters, presenting at least two timbres corresponding to the target character;

In response to the timbre selection operation triggered based on the at least two timbres, the selected target timbre is used as the timbre of the target character, to

During the process of playing the text content by voice, the text content corresponding to the target character is played using the target timbre.
The method of claim 1, wherein the method further comprises:

During the process of playing the text content by voice, presenting a timbre switching key for the text content;

When receiving a trigger operation for the timbre switching button, the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre.
The method of claim 1, wherein the method further comprises:

In the process of playing the text content by voice, presenting the recommended timbre information for the target text content in the text content;

The recommended timbre information is used to instruct to switch the timbre of the character corresponding to the target text content based on the recommended timbre information.
The method of claim 1, wherein the method further comprises:

When there is text content corresponding to the environmental description information in the text content, when the text content corresponding to the environmental description information is played, the environmental music that matches the environmental description information is used as the background music, and all background music.
The method of claim 1, wherein the playing the text content by voice comprises:

determining the emotional color corresponding to each sentence in the text content;

Based on the emotional color corresponding to each sentence, the voice corresponding to each of the sentences is respectively generated, so that the voice carries the corresponding emotional color;

Play the generated voice corresponding to each of the sentences.
The method of claim 10, wherein the determining the emotional color corresponding to each sentence in the text content comprises:

Extracting emotion tags for each statement in the text content, to obtain emotion tags corresponding to each statement;

Using the emotion labels corresponding to the extracted sentences to represent the emotional colors corresponding to the corresponding sentences;

Based on the emotional color corresponding to each statement, the speech corresponding to each statement is respectively generated, including:

Respectively determine the voice parameters that match each of the emotional tags, and the voice parameters include at least one of sound quality and melody;

Based on each of the speech parameters, the speech of each of the sentences is generated.
The method of claim 1, wherein the method further comprises:

When playing the dialogue content in the text content, present a cartoon character, and play an animation in which the cartoon character uses the timbre to read the dialogue content aloud;

Wherein, the cartoon characters match the character characteristics of the characters in the dialogue content.
A voice playback device of an article, the device comprising:

A presentation module, configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;

a receiving module, configured to receive a voice play instruction for the article triggered based on the voice play function item;

a first playing module, configured to play the text content by voice in response to the voice play instruction;

The second playing module is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the content corresponding to the character, use the content that matches the character characteristics of the character. sound to play.
A computer device comprising:

memory for storing executable instructions;

The processor is configured to implement the voice playing method of the article according to any one of claims 1 to 12 when executing the executable instructions stored in the memory.
A computer-readable storage medium, wherein executable instructions are stored, and when executed by a processor, implement the voice playing method of the article according to any one of claims 1 to 12.
A computer program product, comprising a computer program or an instruction, when the computer program or instruction is executed by a processor, the voice playback method of the article of any one of claims 1 to 12 is implemented.