WO2022184055A1 - Speech playing method and apparatus for article, and device, storage medium and program product - Google Patents

Speech playing method and apparatus for article, and device, storage medium and program product Download PDF

Info

Publication number
WO2022184055A1
WO2022184055A1 PCT/CN2022/078610 CN2022078610W WO2022184055A1 WO 2022184055 A1 WO2022184055 A1 WO 2022184055A1 CN 2022078610 W CN2022078610 W CN 2022078610W WO 2022184055 A1 WO2022184055 A1 WO 2022184055A1
Authority
WO
WIPO (PCT)
Prior art keywords
timbre
voice
content
text content
character
Prior art date
Application number
PCT/CN2022/078610
Other languages
French (fr)
Chinese (zh)
Inventor
谢映雪
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022184055A1 publication Critical patent/WO2022184055A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of computer technology, and in particular, to a voice playback method, apparatus, device, computer-readable storage medium, and computer program product of an article.
  • the user when the user reads the article, the user is provided with a voice playback function, that is, the text content of the article is played through the voice, but in the related art, all the content of the article is read aloud by one voice, so that the user cannot be immersed in the content of the article. in the content of the article.
  • Embodiments of the present application provide a voice playback method, device, device, computer-readable storage medium, and computer program product of an article, which can make users feel immersed in the situation when playing text content through voice, and improve the effect of voice playback. of immersion.
  • the embodiment of the present application provides a voice playback method of an article, including:
  • the text content of the article and the voice playback function item corresponding to the article are presented;
  • the timbre matching the character characteristics of the character is used for playing.
  • the embodiment of the present application provides a voice playback device of an article, including:
  • a presentation module configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;
  • a receiving module configured to receive a voice play instruction for the article triggered based on the voice play function item
  • a first playing module configured to play the text content by voice in response to the voice play instruction
  • the second playing module is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use a character feature that matches the character of the character. sound to play.
  • Embodiments of the present application provide a computer device, including:
  • the processor is configured to implement the voice playing method of the article provided by the embodiment of the present application when executing the executable instructions stored in the memory.
  • Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the voice playback method of the article provided by the embodiments of the present application.
  • the embodiments of the present application provide a computer program product, including computer programs or instructions, which, when executed by a processor, implement the voice playback method of the articles provided by the embodiments of the present application.
  • the text content of the article and the voice playback function item of the corresponding article are presented; the voice playback instruction for the article triggered based on the voice playback function item is received; in response to the voice playback instruction, Play the text content by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre that matches the character characteristics of the character is used to play; When the text content is played, the timbre used is matched with the character characteristics corresponding to the text content, so that the user can be immersed in the scene when listening to the text content being played, and can be more immersed in the content of the article. The immersion brought by voice playback.
  • FIG. 1 is a schematic structural diagram of a voice playback system 100 of an article provided by an embodiment of the present application;
  • FIG. 2 is a schematic structural diagram of a computer device 500 provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a voice playback method of an article provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an emotion tag provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of speech parameters provided by an embodiment of the present application.
  • 15 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • 16 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • 17 is a schematic diagram of an application architecture of a blockchain network provided by an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a blockchain in a blockchain network 600 provided by an embodiment of the present application.
  • FIG. 19 is a schematic diagram of a functional architecture of a blockchain network 600 provided by an embodiment of the present application.
  • FIG. 20 is a schematic flowchart of technical side implementation provided by an embodiment of the present application.
  • 21A is a schematic diagram of a fundamental frequency point provided by an embodiment of the present application.
  • FIG. 21B is a diagram of tonal fifths provided by an embodiment of the present application.
  • 22 is a schematic diagram of an acoustic model training process provided by an application embodiment
  • FIG. 23 is a schematic diagram of a construction process of a keyword dictionary provided by an embodiment of the present application.
  • FIG. 24 is a schematic diagram of a personality-based emotion classification model provided by an embodiment of the present application.
  • FIG. 25 is a schematic flowchart of synthesizing audio provided by an embodiment of the present application.
  • first ⁇ second ⁇ third is only used to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that “first ⁇ second ⁇ third” is used in Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.
  • Character characteristics which are used to characterize the characteristics of the characters corresponding to the characters, and can also be understood as the character portrait characteristics of the characters, and the whole information of the tagged characters abstracted according to the basic information of the characters, such as the gender, age, and identity of the characters;
  • the character characteristics may include age characteristics, identity characteristics, gender characteristics, personality characteristics, health status characteristics, and the like.
  • Transaction which is equivalent to the computer term "transaction”.
  • Transaction includes operations that need to be submitted to the blockchain network for execution, not just transactions in a business context.
  • transaction is used in the embodiments of the present application.
  • Blockchain is a storage structure of encrypted and chained transactions formed by blocks.
  • Blockchain Network a set of nodes that incorporate new blocks into the blockchain through consensus.
  • Ledger is a general term for blockchain (also known as ledger data) and a state database synchronized with the blockchain.
  • Smart Contracts also known as Chaincode or application code, are programs deployed in the nodes of the blockchain network, and the nodes execute the smart contracts called in the received transactions to update the state database. Key-value operations to update or query data.
  • Consensus is a process in the blockchain network used to reach agreement on the transactions in the block among the multiple nodes involved, and the agreed block will be appended to the end of the blockchain , the mechanisms for achieving consensus include Proof of Work (PoW, Proof of Work), Proof of Stake (PoS, Proof of Stake), Proof of Share Authorization (DPoS, Delegated Proof-of-Stake), Proof of Elapsed Time (PoET, Proof of Elapsed Time), etc.
  • PoW Proof of Work
  • PoS Proof of Stake
  • DoS Proof of Share Authorization
  • DDoS Delegated Proof-of-Stake
  • PoET Proof of Elapsed Time
  • FIG. 1 is a schematic diagram of the architecture of the voice playback system 100 of the article provided by the embodiment of the present application.
  • a terminal exemplarily shows a terminal 400-1 and a terminal 400-2
  • the network 300 may be a wide area network or a local area network, or a combination of the two.
  • a terminal used for presenting the text content of the article and the voice playback function item corresponding to the article in the content interface of the article; receiving a voice playback instruction for the article triggered based on the voice playback function item; the voice acquisition request of the text content is sent to the server;
  • the server 200 is configured to generate the voice of the text content in response to the voice acquisition request, and send the generated voice of the text content to the terminal;
  • the terminal is used to play the text content through the voice according to the received voice, and in the process of playing the text content through the voice, when the text content includes at least one character, for the text content corresponding to the character, use the character feature of the character. match the tone to play.
  • the server 200 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN, Content Delivery Network), and big data and artificial intelligence platforms.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a computer device 500 provided by an embodiment of the present application.
  • the computer device 500 may be the terminal or server 200 in FIG. 1, and the computer device is the terminal shown in FIG. 1.
  • the computer device 500 shown in FIG. 2 includes: at least one processor 510 , memory 550 , at least one network interface 520 and user interface 530 .
  • the various components in electronic device 500 are coupled together by bus system 540 .
  • the bus system 540 is configured to enable connection communication between these components.
  • the bus system 540 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 540 in FIG. 2 .
  • the processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • DSP Digital Signal Processor
  • User interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual display screens.
  • User interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
  • Memory 550 may be removable, non-removable, or a combination thereof.
  • Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like.
  • Memory 550 optionally includes one or more storage devices that are physically remote from processor 510 .
  • Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • ROM read-only memory
  • RAM random access memory
  • the memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.
  • memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • a presentation module 553 for enabling presentation of information (eg, a user interface for operating peripherals and displaying content and information) via one or more output devices 531 associated with the user interface 530 (eg, a display screen, speakers, etc.) );
  • An input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.
  • the voice playback device for articles provided by the embodiments of the present application may be implemented in software.
  • FIG. 2 shows the voice playback device 555 for articles stored in the memory 550, which may be in the form of programs and plug-ins.
  • the software includes the following software modules: presentation module 5551, receiving module 5552, first playing module 5553 and second playing module 5554. These modules are logical, so any combination or further splitting can be performed according to the realized functions. The function of each module will be explained below.
  • the voice playback device of the article provided by the embodiment of the present application may be implemented in hardware.
  • the voice playback device of the article provided by the embodiment of the present application may be a processor in the form of a hardware decoding processor , which is programmed to execute the voice playback method of the article provided in the embodiment of the present application, for example, the processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC Application Specific Integrated Circuit
  • DSP Programmable Logic Device
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the voice playing method of the article provided by the embodiment of the present application will be described.
  • the voice playing method of the article provided by the embodiment of the present application may be implemented by the terminal alone, or by the server and the terminal collaboratively.
  • FIG. 3 is a schematic flowchart of a voice playback method of an article provided by an embodiment of the present application, which will be described with reference to the steps shown in FIG. 3 .
  • Step 301 The terminal presents the text content of the article and the voice playback function item corresponding to the article in the content interface of the article.
  • the terminal is provided with a client, such as a reading client, an instant messaging client, etc., and the terminal can present the text content of the article through the client.
  • the articles can be novels, prose, popular science articles, etc.
  • the text content refers to the expression of written language, and refers to one or more characters with specific meanings.
  • the text content can be words, words, Phrases, sentences, paragraphs or articles.
  • the terminal may also present a voice play function item corresponding to the article, and the voice play function item is used to play the text content by voice when a trigger operation is received.
  • FIG. 4 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • the text content 401 of the article and the playback function item 402 of the corresponding article are presented.
  • Step 302 Receive a voice play instruction for the article triggered based on the voice play function item.
  • the voice playback instruction for the article when the user reads the text content of the presented article, the voice playback instruction for the article can be triggered based on the voice playback function item.
  • the voice playback instruction for the article can be triggered based on the trigger operation for the voice playback function item
  • the trigger operation includes, but is not limited to, a click operation, a double-click operation, a slide operation, and the like, and the embodiment of the present application does not limit the trigger operation.
  • the voice play instruction for the article when the user clicks the voice play function item 402 in FIG. 4 , the voice play instruction for the article can be triggered.
  • Step 303 In response to the voice play instruction, play the text content by voice.
  • the terminal when the terminal receives the voice play instruction, it acquires voice data corresponding to the text content, and plays the voice data, so as to play the text content through voice.
  • the voice data is generated based on the text content, and the process of generating the voice data may be performed on the terminal or on the server.
  • the terminal may generate and send the voice for the article in response to the voice playback instruction.
  • the playback request is sent to the server, wherein the voice playback request carries the identifier of the article, and the server obtains the text content of the corresponding article based on the identifier of the article carried by the voice playback request, and generates voice data based on the text content, and returns the generated voice data to the
  • the terminal plays the voice data. It should be noted that the voice data played in this application is generated intelligently, rather than pre-generated through voice recording of articles.
  • the terminal when the terminal receives a voice play instruction, it starts to play the text content by voice, and during the process of playing the text content by voice, prompt information may be presented to prompt the user that the text content is being played by voice.
  • the prompt information can be in a variety of forms, for example, the prompt information can be in the form of text, or in the form of images.
  • the prompt information can be presented in a floating form, or the prompt information can be presented in a certain presentation area in the content interface, such as the prompt information is presented at the top of the content interface. The embodiment does not limit the presentation form of the prompt information.
  • the terminal when the prompt information is in the form of text, during the process of playing the text content through voice, the terminal presents a prompt box in a floating form, and presents the text prompt information in the prompt box; wherein the text prompt information is used for Indicates that text content is being played by voice.
  • the presentation form of the prompt box is a floating form, that is, the prompt box is independent of the content interface and is suspended above the content interface.
  • FIG. 5 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application. Referring to FIG. 5 , a prompt box 501 is presented in a floating form, and a text prompt message “You are listening to an intelligent recognition audiobook” is presented in the prompt box 501 .
  • the prompt box is movable, that is, the user can trigger the moving operation for the floating box.
  • the prompt information moves with the movement of the prompt box; in this way, when the prompt box blocks the content that the user wants to browse, the prompt box can be moved to avoid the prompt box from blocking the content that the user wants to browse, thereby improving the user experience. reading experience.
  • the presentation time of the prompt box may be the same as the start time of playing the text content by voice, that is, the prompt box is presented while the text content is played by voice.
  • the presentation duration of the prompt box may be preset, that is, when the presentation duration of the prompt box reaches the preset duration, the prompt box will be cancelled; the presentation duration of the prompt box may also be the same as the duration of playing the text content by voice Consistent, that is, the prompt box is always presented in the process of playing the text content by voice, and when the text content is stopped by voice, the prompt box is canceled; the presentation time of the prompt box can also be controlled by the user, or That is, when the user triggers the close operation for the prompt box, the prompt box is canceled.
  • the presentation style of the prompt box and/or the presentation content in the prompt box may be adjusted, wherein the presentation style of the prompt box includes the shape, size, and presentation position of the prompt box. Wait.
  • the terminal shrinks the prompt box, and switches the text prompt information in the prompt box to a play icon, wherein the play icon is used to indicate that the text content is being played by voice .
  • the duration threshold can be preset, such as system settings, user settings, etc., when the text prompt information is presented, start timing to determine the presentation duration of the text prompt information, and adjust when the presentation duration reaches the duration threshold.
  • the presentation style and content of the prompt box that is, shrink the prompt box to reduce the size of the prompt box, and switch the presented text prompt information to the play icon.
  • the size of the shrunk prompt box is adapted to the content presented in the prompt box.
  • FIG. 6 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application.
  • the duration threshold is 10 seconds
  • the presentation duration of the text prompt information in FIG. 5 reaches 10 seconds
  • the The text prompt message "You are listening to an audiobook with intelligent recognition" is switched to the play icon 61 in FIG. 6
  • the prompt box is shrunk so that the size of the prompt box matches the size of the content in the prompt box.
  • the prompt box when the presentation duration of the text prompt information reaches the duration threshold, the prompt box is shrunk, and the text prompt information in the prompt box is switched to a play icon indicating that the text content is being played by voice, so as to avoid the problems caused by the content of the text prompt information. Too much, the prompt box covers too much text content for a long time, and then affects the reading experience of the text content.
  • Step 304 During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use the timbre matching the character characteristics of the character to play.
  • the text content corresponding to the character refers to the text content associated with the character, such as the character's dialogue content, inner monologue, description content, etc.;
  • the character feature can be a label abstracted from at least two basic information of the character,
  • the character characteristics may include age information, gender information, identity information (such as a domineering president), gender information, personality information, health status information, abstract age characteristics, identity characteristics, Gender characteristics, personality characteristics, health status characteristics.
  • the number of characters included in the text content may be one or more, wherein the number of characters is two or more.
  • the characters and the timbres are in a one-to-one correspondence.
  • the text content of each character is played with the timbre that matches the character's character characteristics, that is, the character characteristics of multiple characters are obtained, and then the character characteristics of each character are matched with the timbre respectively to determine the The timbre that matches the character characteristics of each character; through the acquired timbre, the text content corresponding to the corresponding character is played.
  • the character characteristics of each character are matched with the timbre; in some embodiments, the character characteristics can be performed by using corresponding tags (ie, character tags).
  • Identification for example, an age tag is used for identification, and an identity tag is used for identification.
  • the role characteristics of a specific role in this application include at least two types, that is, the role characteristics of a specific role can have at least two types. kind of labels.
  • multiple (that is, at least two) timbres can be pre-stored, and each timbre corresponds to at least two labels.
  • at least two labels corresponding to the character can be associated with the respective timbres.
  • the corresponding tags are matched to determine the timbre that matches the character's character traits.
  • one of the at least two timbres obtained by matching can be randomly selected as the target timbre, and the selected target timbre corresponding to the character can be used.
  • the text content is played; it is also possible to obtain the matching degree of each timbre and the character characteristics. According to the matching degree, select the timbre with the highest matching degree with the character characteristics as the target timbre, and use the selected target timbre to play the text content corresponding to the character. ; It can also be to present options corresponding to at least two timbres obtained by matching for the user to select, take the timbre selected by the user as the target timbre, and use the selected target timbre to play the text content corresponding to the character.
  • the target timbre in order to use the target timbre to play the text content corresponding to the corresponding character, it is possible to first determine how to pronounce each word in the dialogue content, and then add the timbre feature of the target timbre, so that based on the timbre feature of the target timbre, Generates speech for text content, and then plays the generated speech.
  • the terminal may present at least two timbre options corresponding to the target content in response to a selected operation on the target content in the text content; wherein each timbre option corresponds to a timbre; in response to the at least two timbre options
  • the timbre selection operation triggered by the option takes the selected target timbre as the timbre of the character corresponding to the target content, so that in the process of playing the text content through voice, the target timbre is used for the text content corresponding to the character corresponding to the target content. play.
  • the user can select the timbre of a character by himself, so that when the terminal plays the text content corresponding to the character, the timbre selected by the user is used for playing.
  • the user selects the character whose timbre needs to be selected based on the presented text content.
  • the character is selected by selecting the text content, that is, the character corresponding to the selected target content is taken as the selected character.
  • the target content is determined, at least two timbre options corresponding to the target content are presented.
  • the timbre options when the timbre options are presented, according to the degree of matching between each timbre and the character characteristics of the character corresponding to the target content, the timbre options For example, the timbre option corresponding to the timbre with a higher degree of matching with the character characteristics of the character corresponding to the target content is presented at the front.
  • the user selects the timbre to be selected based on the presented at least two timbre options.
  • the selection operation here may be a click operation on the timbre option corresponding to the target timbre, or a pressing operation on the timbre option corresponding to the target timbre , the trigger form of the selected operation is not limited here.
  • the at least two timbre options corresponding to the target content may be presented in the form of a drop-down list, an icon, or an image.
  • the presentation forms of the at least two timbre options are not limited here.
  • at least two timbre options can be presented directly in the content interface, or a floating layer independent of the content interface can be presented, and at least two timbre options can be presented in the floating layer.
  • the above-mentioned selection operation for the target content and tone selection operation may be performed before the text content is played by voice, and may also be performed during the process of playing the text content by voice.
  • FIG. 7 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • the user selects the target content based on the presented text content.
  • the target content can be selected by clicking on the text, that is, when the user's click operation is received , take the statement presented at the clicked position as the target content, and present a floating layer in which at least two timbre options 701 are presented.
  • the user before selecting a timbre, can audition each timbre, that is, after the terminal presents at least two timbre options corresponding to the target content, it can also present at least two timbre options; in response to the target timbre For the trigger operation of the corresponding audition function item, the target content is played by using the target timbre corresponding to the audition function item.
  • each timbre option may correspond to an audition function item. After the user triggers a certain audition function item, the target timbre corresponding to the audition function item is determined, and then the target content is played based on the target timbre.
  • FIG. 8 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • the target content can be selected by clicking on the text, that is, when the user's click is received Operation, take the statement presented at the clicked position as the target content, and present a floating layer, and present at least two timbre options 801 in the floating layer.
  • An image of a cartoon character, and a textual description of the timbre, such as silly white sweet type; and an audition function item 802 is presented under each timbre option, and the audition function items correspond to the timbre options one-to-one.
  • the target content that is, the selected sentence
  • the silly white sweet tone is played with the silly white sweet tone.
  • the terminal may present at least two timbre options corresponding to the target content and a determination function item in response to a selection operation on the target content in the presented dialogue content; wherein each timbre option corresponds to a timbre; the response In the timbre selection operation triggered based on at least two timbre options, the selected target timbre is used to play the target content; in response to the triggering operation for determining the function item, the target timbre is used as the timbre of the role corresponding to the target content, so as to pass the voice
  • the dialogue content of the character corresponding to the target content is played using the target timbre.
  • the user can switch the selected timbre before triggering the confirmation function item, and after the media selects the timbre, the selected timbre will be used to play the target content. In this way, the user can determine whether to select the timbre according to the playing sound. It avoids the need to re-select after selection error, and improves the efficiency of human-computer interaction.
  • a timbre selection function item in the content interface of the article, is presented; in response to a trigger operation for the timbre selection function item, at least two characters in the article are presented; in response to a target character in the at least two characters
  • the selection operation presents at least two timbres corresponding to the target roles; in response to the timbre selection operations triggered based on the at least two timbres, the selected target timbres are used as the timbres of the target roles, so that in the process of playing the text content by voice, For the dialogue content of the target character, use the selected target voice to play.
  • the terminal after receiving the timbre selection function item, can also present at least two characters in the article.
  • all characters in the article can be presented, or only some characters in the article can be presented. For example, only the current presentation can be presented.
  • the user After presenting at least two characters in the article, the user can select one of them as the target character to select the target timbre of the target character.
  • the target timbre is selected for one role
  • other roles may be selected from at least two roles to select timbres for other roles.
  • the user can not only select the timbre of the character corresponding to the session content in the current content interface, but also can select the timbre of the character corresponding to the unpresented session content, so that by triggering the timbre selection function item once, it is possible to select the timbre of the multiple presented timbres.
  • the timbre of each character is selected, which improves the efficiency of human-computer interaction.
  • FIG. 9 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • a timbre selection function item 901 is presented on the content interface.
  • the selection interface presents all characters 902 in the article, such as character A, character B, and character C; when the user clicks on a character, such as clicking on "character A", multiple timbres matching the character characteristics of "character A" are presented 903, the user may select a timbre from the multiple presented timbres as the target timbre.
  • the terminal may also present a timbre switching button for the text content during the process of playing the text content by voice; when receiving a trigger operation for the timbre switching button, the terminal may change the timbre corresponding to the currently playing content from the first One tone switches to the second tone.
  • the embodiment of the present application provides a button for quickly switching the timbre, that is, the timbre switching button.
  • the timbre switching button is used to switch the timbre corresponding to the sentence currently being played, wherein the first The timbre is the currently playing timbre, and the second timbre is the recommended timbre for switching, where the first timbre is different from the second timbre.
  • the second timbre corresponds to the currently playing sentence
  • the second timbres corresponding to different sentences may be the same or different.
  • both the first timbre and the second timbre may be timbres that match the character characteristics of the character corresponding to the currently playing content. For example, when a certain dialogue content is played, the character of the character corresponding to the dialogue content is obtained. multiple timbres with matching characteristics, and then select one of the multiple timbres as the first timbre, and select one as the second timbre, first use the first timbre to play the dialogue content, and after receiving the trigger operation for the switch button, The first timbre is switched to the second timbre, that is, the dialogue content is played using the second timbre after switching.
  • the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre, the content belonging to the same role as the currently playing content is played using the second timbre.
  • the timbre switching button may be triggered again, and after receiving the trigger operation for the timbre switching button, the second timbre Switch to the third timbre, where the first timbre can be the same as the third timbre, or it can be different.
  • the terminal during the process of playing the text content through voice, the terminal presents recommended timbre information for the target text content in the text content; wherein the recommended timbre information is used to indicate, based on the recommended timbre information, what information about the target text content is to be determined. The timbre of the corresponding character is switched.
  • a timbre may be recommended for the user, and the target text content here may be the currently playing text content, or may be the text content in which the character characteristics of any corresponding character match the recommended timbre information.
  • the target text content here may be the currently playing text content, or may be the text content in which the character characteristics of any corresponding character match the recommended timbre information.
  • obtain a timbre that matches the character characteristics of the character in the current dialogue content and generate recommended timbre information based on the timbre obtained by matching, for example, based on the timbre with the highest degree of matching, generate recommended timbre information; or,
  • FIG. 10 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • recommended timbre information 1001 is presented, such as “Lin xx The voice of the fifth sister matches the voice of the fifth sister very well” to prompt the user to switch the tone of the fifth sister to the tone of Lin xx.
  • a timbre switching button matching the recommended timbre information is presented, and after receiving the user's trigger operation on the timbre switching button, the timbre corresponding to the corresponding dialogue content is switched to the recommended timbre The Voice indicated by the Voice information.
  • FIG. 11 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • recommended timbre information 1101 is presented, such as “Lin xx The voice of xx matches the voice of the fifth junior sister very well", and the timbre switching button 1102 is displayed at the same time.
  • the tone switching button 1102 the text content corresponding to the fifth junior sister is played using Lin xx's voice, such as the dialogue content of the fifth junior sister.
  • the terminal when there is text content corresponding to the environment description information in the text content, when playing the text content corresponding to the environment description information, the terminal may use the environment music that matches the environment description information as the background music, and play background music.
  • the environment description information in the text content is obtained.
  • a key dictionary of the environment description information can be preset, and the key dictionary stores various environment description information. Describe the keywords corresponding to the information, and then match the text content with the keywords in the key dictionary.
  • the text content contains text content that matches the keywords in the key dictionary, it is determined that there is text content corresponding to the environmental description information, And extract the text content that matches the keywords in the key dictionary, and match the text content with each ambient music to obtain ambient music that matches the environment description information.
  • the environmental description information contained in the text content is a rainy night
  • ambient music that matches the rain can be obtained
  • the text content corresponding to the environmental description information is played
  • the music matching the rain can be obtained.
  • Ambient music is played as background music.
  • the terminal can also play the text content in the following ways: determine the emotional color corresponding to each statement in the text content; Color; play the generated voice corresponding to each sentence.
  • each sentence in the text content has a corresponding emotional color, especially for the dialogue content in the text content, the characters in the text are emotional when they speak, such as sadness, happiness, etc. .
  • the generated voice carries emotional color, so that the user can have an immersive feeling when hearing the voice.
  • the emotional color corresponding to each sentence is not only based on the sentence itself, but also needs to be combined with the context of the sentence to improve the accuracy of emotional color determination. For example, it is only possible to judge that the current character is crying based on "she said with tears at this time", but it is impossible to judge whether the emotional color corresponding to the sentence is crying with joy or crying, which needs to be judged in conjunction with the context.
  • the terminal may determine the emotional color corresponding to each sentence in the text content by: extracting the emotional label of each sentence in the text content to obtain the emotional label corresponding to each sentence; using the extracted emotional label corresponding to each sentence , indicating the emotional color corresponding to the corresponding sentence; the terminal can generate the voice corresponding to each sentence based on the emotional color corresponding to each sentence in the following ways: respectively determine the voice parameters that match each emotional label, and the voice parameters include sound quality, rhythm, etc. At least one; based on each speech parameter, the speech of each sentence is generated.
  • the emotional label here includes at least one of the following: basic information, recognition Knowledge evaluation, psychological feelings.
  • FIG. 12 is a schematic diagram of an emotion tag provided by an embodiment of the present application.
  • the emotion tag includes basic information, cognitive evaluation, and psychological feeling, wherein the cognitive evaluation includes discourse tendency and discourse style.
  • the discourse tendency may be Negative or affirmative, indifferent or enthusiastic; basic information includes age information (such as children, young people, etc.), gender information, identity information (such as domineering president); psychological feelings include positive feelings (such as comfort, sympathy, etc.) and negative feelings (such as grief, panic).
  • the acquired emotional tags of a sentence may be one or more.
  • the voice parameters that match the emotional tags can be determined directly based on the correspondence between the emotional tags and the voice parameters; It is possible to first perform emotion prediction based on multiple emotion tags, and then obtain speech parameters matching the emotion tags according to the correspondence between the predicted emotions and speech parameters. After acquiring the speech parameters, the speech of the corresponding sentence is generated based on the speech parameters.
  • FIG. 13 is a schematic diagram of speech parameters provided by an embodiment of the present application.
  • the speech parameters include sound quality and rhythm, wherein the sound quality includes brightness, saturation, etc., and the rhythm includes pitch, speech rate, syllable interval, rhythm, intonation, etc.
  • Fig. 14 is a schematic diagram of the correspondence between emotions and speech parameters provided by an embodiment of the present application.
  • different emotions correspond to different speech parameters. For example, when the emotion is joy, the speech rate is brisk, but sometimes slower; when the emotion is anger , speak a little faster.
  • the terminal may also present a cartoon character when playing the dialogue content in the text content, and play an animation in which the cartoon character uses a timbre to read the dialogue content; wherein the cartoon character and the character corresponding to the dialogue content characteristics match.
  • the terminal can also obtain a cartoon character matching the character characteristic according to the character characteristic of the character corresponding to the dialogue content, and play an animation in which the cartoon character reads the dialogue content aloud using the timbre of the character characteristic, so, Users are able to integrate into the scene described in the article from both hearing and vision, bringing users a better sense of immersion.
  • FIG. 15 is a schematic diagram of a content interface provided by an embodiment of the present application.
  • the character corresponding to the dialogue content here is a child, and a cartoon character 1501 in the image of a child is presented on the content interface, and the cartoon character 1501 is played to Animation of children's voice reading the content of the dialogue.
  • the dialogue content in the text content is played using a timbre that matches the character characteristics of the character corresponding to the dialogue content: from the content of the article, the basic information of the character corresponding to the dialogue content is extracted; The timbre that matches the basic information; the acquired timbre is used to play the dialogue content in the text content.
  • the basic information includes at least one of the following: age information, gender information, and identity information.
  • the basic information of the role corresponding to the dialogue content is extracted from the content of the article, which can be extracted from the presented text content or from the unpresented text content. It is understandable that here is the Combine all the text content describing the role in the article to extract the basic information corresponding to the role.
  • the terminal can also display the currently playing sentence differently; as the voice playing progresses, the text content of the article is scrolled and presented, so that the presented text content is different from the voice. match the progress of the playback.
  • FIG. 16 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 16 , a gray background color is used to present the currently playing sentence 1601 to distinguish it from other sentences.
  • the text content of the article can be scrolled and presented, so that the currently playing sentence is always in the middle of the screen.
  • the terminal can also display the currently playing sentence differently; as the voice playing progresses, turn pages to present the text content of the article, so that the presented text content is different from the text content of the article. Match the progress of the voice playback.
  • the page turning process can be performed to present the text content of the next page of the article, and continue to play the text content of the next page by voice, so that the presented text content and the voice play progress to match.
  • the terminal can also obtain the character characteristics of each character from the content of the article, and store the character characteristics of each character in the blockchain network; in this way, when other terminals need to play the text content of the article by voice, The character characteristics of each character in the article can be obtained directly from the blockchain.
  • the embodiment of the present application can also combine the blockchain technology, after the terminal obtains the character characteristics of each character and obtains the character characteristics of each character, generates a transaction for storing the character characteristics of each character, and submits the generated transaction to the blockchain.
  • the node of the network so that the node can store the role characteristics of each role to the blockchain network after consensus on the transaction; before storing to the blockchain network, the terminal can also hash the role characteristics of each role to obtain the corresponding role characteristics. Summary information of character characteristics; store the obtained summary information of character characteristics corresponding to each character to the blockchain network. In the above manner, the character characteristics of each character are prevented from being tampered with, and the security of the character characteristics of each character is improved.
  • FIG. 17 is a schematic diagram of an application architecture of a blockchain network provided by an embodiment of the present application, including a business entity 400 and a blockchain network 600 (exemplarily showing a consensus node 610-1 to a consensus node 610-3) , and the authentication center 700, which will be described separately below.
  • the type of the blockchain network 600 is flexible and diverse, for example, it can be any one of a public chain, a private chain or a consortium chain.
  • the electronic equipment of any business entity such as user terminals and servers, can access the blockchain network 600 without authorization; taking the alliance chain as an example, the business entity will govern after obtaining authorization.
  • the computer equipment (for example, a terminal/server) can access the blockchain network 600, and at this time, for example, become a client node in the blockchain network 600.
  • the client node may only serve as an observer of the blockchain network 600, that is, provide the function of supporting business entities to initiate transactions (for example, for storing data on the chain or querying data on the chain), for the blockchain network
  • the functions of the consensus node 610 of 600 such as ordering function, consensus service and ledger function, etc., can be implemented by the client node by default or selectively (eg, depending on the specific business needs of the business entity). Therefore, the data and business processing logic of the business subject can be migrated to the blockchain network 600 to the greatest extent, and the trustworthiness and traceability of the data and business processing process can be realized through the blockchain network 600 .
  • the consensus node in the blockchain network 600 receives the transaction submitted by the client node of the business entity 400, executes the transaction to update the ledger or query the ledger, and various intermediate or final results of the executed transaction can be returned to the client node of the business entity. show.
  • the client node 410 can subscribe to events of interest in the blockchain network 600, such as transactions occurring in a specific organization/channel in the blockchain network 600, and the consensus node 610 pushes corresponding transaction notifications to the client node 410 , thereby triggering the corresponding business logic in the client node 410 .
  • the following describes an exemplary application of the blockchain by taking the business entity accessing the blockchain network to realize the voice playback of the article as an example.
  • the business subject 400 involved in the voice playback of the article registers and obtains a digital certificate from the certification center 700.
  • the digital certificate includes the public key of the business subject, and the public key and identity information of the business subject signed by the certification center 700.
  • the digital signature is used to attach to the transaction together with the digital signature of the business subject for the transaction, and is sent to the blockchain network for the blockchain network to extract the digital certificate and signature from the transaction to verify the reliability of the message (ie Whether it has not been tampered with) and the identity information of the business subject sending the message, the blockchain network will verify it according to the identity, such as whether it has the authority to initiate transactions.
  • Clients running on computer equipment (such as terminals or servers) under the jurisdiction of the business entity can request access to the blockchain network 600 to become client nodes.
  • the client node 410 of the business entity 400 is used to play the text content by voice.
  • voice For example, in the content interface of the article, the text content of the article and the voice playback function item of the corresponding article are presented;
  • the voice play instruction in response to the voice play instruction, the text content is played by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the character characteristics of the character are used.
  • the matching Voice is played.
  • the terminal acquires the character characteristics of each character in the article, and sends the character characteristics of each character to the blockchain network 600 .
  • the operation of sending the role characteristics of each role to the blockchain network 600 can set business logic in the client node 410 in advance.
  • the client node 410 sends the role of each role
  • the features are automatically sent to the blockchain network 600 , or the business personnel of the business entity 400 can log in in the client node 410 , manually package the role features of each role, and send them to the blockchain network 600 .
  • the client node 410 generates a transaction corresponding to the storage operation according to the role characteristics of each role, and specifies the smart contract to be called to realize the storage operation and the parameters passed to the smart contract in the transaction, and the transaction also carries the client node.
  • the signed digital signature for example, obtained by encrypting the transaction digest using the private key in the digital certificate of the client node 410
  • broadcasting the transaction to the consensus nodes in the blockchain network 600 (such as Consensus node 610-1, consensus node 610-2, consensus node 610-3).
  • the consensus node in the blockchain network 600 receives the transaction, it verifies the digital certificate and digital signature carried in the transaction. After the verification is successful, it is confirmed whether the business subject 400 has the transaction status according to the identity of the business subject 400 carried in the transaction. Any one of authority, digital signature and authority verification will cause the transaction to fail. After the verification is successful, the consensus node's own digital signature (for example, obtained by encrypting the transaction digest with the private key of the consensus node 610-1), continues to broadcast in the blockchain network 600.
  • the consensus node's own digital signature for example, obtained by encrypting the transaction digest with the private key of the consensus node 610-1
  • the consensus node in the blockchain network 600 After receiving the successfully verified transaction, the consensus node in the blockchain network 600 fills the transaction into a new block and broadcasts it. When a consensus node in the blockchain network 600 broadcasts a new block, it will perform a consensus process on the new block. If the consensus is successful, the new block will be appended to the end of the blockchain stored by itself, and the status will be updated according to the transaction result. Database, execute transactions in the new block: For transactions that submit and update the character characteristics of each character, add the character characteristics of each character in the state database.
  • FIG. 18 is a schematic structural diagram of a blockchain in a blockchain network 600 provided by this embodiment of the application.
  • the header of each block may include hashes of all transactions in the block. It also contains the hash value of all transactions in the previous block.
  • the record of the newly generated transaction is filled into the block and after the consensus of the nodes in the blockchain network, it will be appended to the end of the blockchain.
  • the chain-like growth is formed, and the chain-like structure based on the hash value between the blocks ensures the tamper-proof and anti-forgery of the transactions in the blocks.
  • FIG. 19 is a schematic diagram of the functional architecture of the blockchain network 600 provided by the embodiment of the present application.
  • the blockchain network includes an application layer 601 , the consensus layer 602, the network layer 603, the data layer 604 and the resource layer 605, which will be described separately below.
  • the resource layer 605 encapsulates the computing resources, storage resources and communication resources for realizing each consensus node in the blockchain network 600 .
  • the data layer 604 encapsulates various data structures that implement the ledger, including a blockchain implemented as files in a file system, a key-value state database, and proofs of existence (eg, a hash tree of transactions in blocks).
  • the network layer 603 encapsulates the functions of point-to-point (P2P, Point to Point) network protocol, data dissemination mechanism and data verification mechanism, access authentication mechanism and business subject identity management.
  • P2P Point to Point
  • P2P Point to Point
  • the P2P network protocol realizes the communication between consensus nodes in the blockchain network 600
  • the data dissemination mechanism ensures the dissemination of transactions in the blockchain network 600
  • the data verification mechanism is used based on cryptographic methods (such as digital certificates, digital signature, public/private key pair) to achieve the reliability of data transmission between consensus nodes
  • the access authentication mechanism is used to authenticate the identity of the business entity joining the blockchain network 600 according to the actual business scenario, and when the authentication is passed
  • the business entity is given the permission to access the blockchain network 600
  • the business entity identity management is used to store the identity of the business entity allowed to access the blockchain network 600, as well as the permission (for example, the type of transaction that can be initiated).
  • the consensus layer 602 encapsulates a mechanism (ie, a consensus mechanism) for consensus nodes in the blockchain network 600 to reach consensus on blocks, and functions of transaction management and ledger management.
  • the consensus mechanism includes consensus algorithms such as POS, POW, and DPOS, and supports the pluggability of consensus algorithms.
  • Transaction management is used to verify the digital signature carried in the transaction received by the consensus node, verify the identity information of the business entity, and determine whether it has the authority to conduct transactions according to the identity information (read relevant information from the business entity identity management); For authorized business entities accessing the blockchain network 600, they all have digital certificates issued by the certification center. The business entities use the private key in their digital certificates to sign the submitted transactions, thereby declaring their legal identity.
  • Ledger management is used to maintain the blockchain and state database.
  • For the consensus block append it to the end of the blockchain; execute the transaction in the consensus block, update the key-value pair in the state database when the transaction includes an update operation, and query the state database when the transaction includes a query operation and returns the query result to the client node of the business principal.
  • Supports query operations in various dimensions of the state database including: querying blocks according to block serial numbers (such as transaction hash values); querying blocks according to block hash values; querying blocks according to transaction serial numbers; Query transactions by transaction serial number; query the account data of the business entity according to the account number (serial number) of the business entity; query the blockchain in the channel according to the channel name.
  • the application layer 601 encapsulates various services that the blockchain network can implement, including transaction traceability, certificate storage, and verification.
  • the terminal presents the text content of the article, and the user browses the presented text content.
  • the listening function can be enabled, for example, the user clicks to play After the function item, the text content of the article is played by voice; during the playback process, when it is recognized that there is dialogue content in the article, the timbre of the character characteristics of the character corresponding to the dialogue content is obtained, and the voice of the character corresponding to the dialogue content is used.
  • the timbre of the character features generates the voice of the dialogue content, and adds emotional color to the voice according to the emotional color corresponding to the dialogue content; when it is recognized that there is environmental description information in the article, for the text content containing the environmental description information, in the corresponding In the speech of the text content, ambient music that matches the environment description information is added as background music.
  • the text content 401 of the article and the play function item 402 of the corresponding article are presented; when the user clicks the play function item 402, the terminal starts to play the text content of the article by voice , the prompt box 501 is presented in a floating form, and the text prompt information "You are listening to the intelligent recognition audiobook" is presented in the prompt box 501; when the presentation duration of the text prompt information in FIG. 5 reaches the duration threshold, the The text prompt information is switched to the play icon 61 in FIG. 6 , and the prompt box is shrunk so that the size of the prompt box matches the size of the content in the prompt box.
  • the user can independently select the timbre for the character in the article, that is, the user can independently select the timbre according to their own preferences. For example, first, the user selects a character whose timbre needs to be selected based on the presented text content. Here, the character is selected by selecting the text content, that is, the character corresponding to the selected target content is taken as the selected character. Then, after the target content is determined, at least two timbre options corresponding to the target content are presented; then, the user selects the timbre to be selected based on the presented at least two timbre options.
  • the user selects the target content based on the presented text content.
  • the target content can be selected by clicking on the text. That is, when the user's click operation is received, the sentence presented at the click position is used as the target content, and a floating layer, at least two timbre options 701 are presented in the floating layer, where the timbre options are presented in the form of a combination of graphics and text, that is, images containing cartoon characters matching the timbres are presented, and textual descriptions of the timbres are presented, such as silly white Sweet type, user can make timbre selection based on presented timbre options.
  • the user can audition each timbre to be selected, that is, the user can trigger the audition operation for the timbre, the terminal determines the timbre that the user wants to audition, and plays the selected target through the timbre to be auditioned. content, to realize the audition of the timbre.
  • the user can select the timbre according to the auditioned voice, which is more in line with the real scene and improves the user experience.
  • a floating layer may pop up, and the recommended timbre information will be displayed in the floating layer, and the information that matches the recommended timbre will be displayed.
  • the timbre corresponding to the currently playing dialogue content can be switched to the timbre indicated by the recommended timbre information.
  • the recommended timbre information 1101 is presented, such as “Lin xx’s voice matches the voice of the fifth junior sister very well”, and the timbre is presented at the same time Switch button 1102, when the user clicks the tone switch button 1102, the terminal responds to the click operation and switches the currently used tone to Lin xx's tone, that is, Lin xx's voice is used to play the dialogue content of the fifth sister after the switch.
  • FIG. 20 is a schematic flowchart of the technical side implementation provided by the embodiment of the present application.
  • the voice playback method of the article provided by the embodiment of the present application includes:
  • Step 2001 The terminal collects audio data.
  • the terminal first starts recording and collects the required audio data to build an emotional corpus.
  • emotional corpus is an important basis for the research on emotional speech synthesis.
  • the terminal In the process of collecting audio data, it is necessary to analyze the collected audio data. Screening, for example, after starting the recording, the terminal performs decibel detection on the collected audio data. If the background sound in the collected audio data is noisy, the collected audio data is filtered out and re-recorded until the screen meets the requirements (no audio is present). quality issues) audio data.
  • the recording can be recorded segment by segment. After the audio data corresponding to the recording of each segment is collected, the collected audio data can be uploaded to the server for detection. When the audio data is detected If there is an audio quality problem, re-record.
  • the recorded audio data needs to be marked by the praat tool, such as the fundamental frequency, syllable boundary, paralinguistic information, etc. of the audio data. These information are used to add emotional state labels and emotional keys when training the model later. Annotation information for word attributes.
  • FIG. 21A is a schematic diagram of a fundamental frequency point provided by an embodiment of the present application.
  • the figure shows a graph of the fundamental frequency points of “Ma” and “Ma”, wherein the tone of “Ma” is Yin-ping , its corresponding curve is a curve that is close to the level, the tone of "ma” is positive, and the corresponding curve is a curve that changes from bottom to top;
  • Figure 21B is a five-degree value diagram of the tone provided by the embodiment of the application, see Figure 21B , the graph is the same as the curve in the fundamental frequency point diagram. It is understandable that even if there is no speech, it is possible to know when to pronounce "mama” and when to pronounce "ma” according to the fundamental frequency point and the tone fifth value map.
  • Step 2002 Train the acoustic model.
  • the terminal After the terminal obtains the audio data, it preprocesses the audio data.
  • the preprocessing here includes pre-emphasis, framing and other processing. The purpose of these operations is to eliminate the confusion caused by the human vocal organ itself and the device that collects the voice signal. In order to make the signal obtained by subsequent speech processing more uniform and smooth, it provides high-quality parameters for the extraction of signal parameters to improve the quality of speech processing.
  • the preprocessed audio data is stored in the database, and the acoustic model is trained based on the stored audio data. For example, the acoustic model can learn how to pronounce each pronunciation and the timbre characteristics, so as to obtain the required acoustic model. .
  • an acoustic model can be trained.
  • the audio data is first subjected to acoustic analysis.
  • the prosodic features of syllables play a very important role in the prosody analysis of toned syllables, and the speech parameters can be divided into: tone quality and rhythm.
  • the sound quality may include brightness and saturation;
  • the temperament may include pitch, speech rate, syllable interval, and the like.
  • a person expresses excitement he speaks at a fast rate, with a high pitch, and may have a certain breathing sound. In this way, information such as fundamental frequency parameters and spectral parameters under the basic emotional color can be obtained.
  • FIG. 22 is a schematic diagram of the training process of the acoustic model provided by the application embodiment.
  • the fundamental frequency parameter is extracted to obtain the fundamental frequency parameter
  • the spectral parameter is extracted for the speech signal in the speech corpus
  • the hidden Markov model is trained according to the fundamental frequency parameter and the spectral parameter.
  • the speech corpus here is constructed based on the above-mentioned audio data stored in the library.
  • the function of the spectral parameters and fundamental frequency parameters here is to make the synthesized sentences more smooth and natural.
  • the spectral parameters are represented by the Mel Frequency Ceptrum Coefficient (MFCC, Mel Frequency Ceptrum Coefficient) and its first-order and second-order delta coefficients.
  • the fundamental frequency parameters are represented by The fundamental frequency F0 and its first-order second-order delta coefficients are represented.
  • the Mel cepstral coefficient is a classic speech feature, which is a feature parameter extracted based on the characteristics of the human auditory domain, and is an engineering simulation of the human auditory characteristics.
  • human auditory perception also includes the perception of loudness.
  • the human ear's perception of loudness is related to the sound frequency band. Transforming the spectrum of the speech signal into the perceptual frequency domain can better simulate the human hearing process.
  • the meaning of Mel frequency is that 1 Mel is 1/1000 of the degree of pitch perception at 1000 Hz.
  • the fundamental frequency F0 is the lowest frequency of the filter application range.
  • Step 2003 Synthesize audio.
  • first input the text of the article preprocess the text of the article, first segment the text, convert the text into a sentence composed of words, and then label the sentence at the phoneme level, syllable level, and word level. Synthesize helpful information.
  • the text needs to be analyzed step by step, such as word, sentence, chapter, book.
  • word is the word that we have filtered through a specific threshold
  • keyword extraction is performed; Keywords related to emotional tags, such as character, mood, scene, gender, etc., are filtered out of the dictionary.
  • FIG. 23 is a schematic diagram of the construction process of the keyword dictionary provided by the embodiment of the present application.
  • a large-scale text corpus is first constructed to train a word vector model; etc., since the novel tags and general databases have been screened, a seed dictionary is constructed based on the novel tags and general databases; then, model training is performed based on the word vector model and the seed dictionary to predict new words based on the model obtained by training; The obtained new words are added to the keyword dictionary to construct the keyword dictionary.
  • FIG. 24 is a schematic diagram of the character-based emotion classification model provided by the embodiment of the present application, and the emotion label related to the character of the character in the article can be extracted in the following manner:
  • the word vector representation of the words in the text is obtained by Word2Vec (a tool for training word vector models), and then the word vector matrix in the paragraph or chapter is obtained, and the word vector matrix is input into the character-based text analyzer 2401 to obtain different types of
  • Word2Vec a tool for training word vector models
  • the word vector matrix in the paragraph or chapter is obtained
  • the word vector matrix is input into the character-based text analyzer 2401 to obtain different types of
  • For text groups, different types of text groups are input into corresponding types of classifiers 2402, and finally the output results of each classifier are fused to obtain the final classification result.
  • C, A, and E refer to the three dimensions of extroversion, pleasantness and responsibility, respectively, and H and L respectively refer to the level of the personality value of each dimension.
  • HA means high agreeableness
  • HC means more extroversion
  • LE means low responsibility type. Wait.
  • emotional tags required for speech synthesis can be obtained, such as novel tags, basic information (character, identity, age, gender), and cognitive evaluation (environment, emotion). Then, based on these emotional labels, emotion prediction is performed to predict the emotional color attached to the person when they say the corresponding sentence.
  • emotional color is not only determined by text information, but also affected by the environment and status of the characters in the article. Based on this, the present application infers the emotional color of the character from the context of the text, so that the correct speech can be synthesized smoothly. For example, “she said in tears at this time”, at this time we have to predict whether her emotional color is crying with joy or crying with sadness.
  • FIG. 25 is a schematic flowchart of a synthesized audio provided by an embodiment of the present application. Referring to FIG. 25 , the process of synthesizing audio includes:
  • Step 2501 Parse the text.
  • parsing text includes syntactic parsing and semantic parsing, wherein syntactic parsing includes part-of-speech tagging, word parsing, and pronunciation parsing.
  • Step 2502 Emotion tag extraction.
  • the extracted emotional tags include novel tags, basic information (character, identity, age, gender), and cognitive evaluation (environment, emotion).
  • Step 2503 Label the speech.
  • the speech is annotated by the extracted emotional labels.
  • the labeling logic is the same as when training the acoustic model, that is, adjusting the fundamental frequency parameters and other information.
  • the fundamental frequency parameters output by the HMM model are obtained, and the fundamental frequency parameters output by the HMM model are adjusted based on the emotion labels to obtain the final fundamental frequency parameters.
  • Step 2504 Synthesize audio.
  • audio is synthesized based on fundamental frequency parameters and spectral parameters output by the HMM model.
  • the application of the above-mentioned embodiment enables the user to be immersed in the scene while listening to the book, and can more immersely enter the scene of the novel, thereby improving the user experience and usage time.
  • Software modules can include:
  • the presentation module 5551 is configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;
  • a receiving module 5552 configured to receive a voice play instruction for the article triggered based on the voice play function item
  • the first playing module 5553 is configured to play the text content by voice in response to the voice play instruction
  • the second playing module 5554 is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use the text content that matches the character. sound to play.
  • the presentation module is further configured to present a prompt box in a floating form during the process of playing the text content by voice
  • the text prompt information is used to prompt that the text content is being played by voice.
  • the presentation module is further configured to shrink the prompt box when the presentation duration of the text prompt information reaches a duration threshold
  • the text prompt information in the prompt box is switched to a play icon; wherein the play icon is used to indicate that the text content is being played by voice.
  • the second playing module is further configured to, in response to a selected operation on target content in the text content, present at least two timbre options corresponding to the target content; wherein each of the The timbre option corresponds to a timbre;
  • the selected target timbre is used as the timbre of the character corresponding to the target content, to
  • the text content corresponding to the character corresponding to the target content is played using the target timbre.
  • the first playing module is further configured to present the audition function items of the at least two timbres
  • the target content is played by using the target tone color corresponding to the audition function item.
  • the first playback module is also configured to present a timbre selection function item in the content interface of the article;
  • the selected target timbre is used as the timbre of the target character, to
  • the text content corresponding to the target character is played using the target timbre.
  • the first playing module is further configured to present a tone switching button for the text content during the process of playing the text content by voice;
  • the timbre corresponding to the text content is switched from the first timbre to the second timbre.
  • the first playing module is further configured to, during the process of playing the text content by voice, when playing the dialogue content in the text content, present the target text in the text content Recommended tone information for the content;
  • the recommended timbre information is used to instruct to switch the timbre of the character corresponding to the target text content based on the recommended timbre information.
  • the first playing module is further configured to, when there is text content corresponding to the environment description information in the text content, when the text content corresponding to the environment description information is played, the text content corresponding to the environment description information is played.
  • the ambient music that matches the environment description information is used as the background music, and the background music is played.
  • the first playing module is further configured to determine the emotional color corresponding to each sentence in the text content
  • the voice corresponding to each of the sentences is respectively generated, so that the voice carries the corresponding emotional color
  • the first playback module is further configured to perform emotional tag extraction on each sentence in the text content to obtain emotional tags corresponding to each of the sentences, where the emotional tags include at least one of the following: basic information, cognitive evaluation, psychological feelings;
  • the speech of each of the sentences is generated.
  • the first playing module is further configured to present a cartoon character when playing the dialogue content in the text content, and play an animation in which the cartoon character uses the timbre to read the dialogue content aloud ;
  • the cartoon characters match the character characteristics of the characters in the dialogue content.
  • the first playback module is further configured to extract, from the content of the article, portrait information of the character corresponding to the content of the dialogue;
  • the dialogue content in the text content is played by using the acquired timbre adapted to the portrait information.
  • the first playing module is further configured to differentiate and display the currently playing sentences during the process of playing the text content by voice;
  • the text content of the article is scrolled and presented so that the presented text content matches the progress of the voice playback.
  • the first playback module is also configured to display the currently played sentences differently in the process of playing the text content by voice;
  • the text content of the article is presented by page turning, so that the presented text content matches the progress of the voice playback.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the voice playback method of the above-mentioned article in the embodiment of the present application.
  • the embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application, for example , as shown in Figure 3.
  • the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
  • HTML Hyper Text Markup Language
  • One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided in the present application are a speech playing method and apparatus for an article, and a device and a computer-readable storage medium. The method comprises: presenting, in a content interface of an article, text content of the article and a speech playing function item corresponding to the article; receiving a speech playing instruction which is triggered for the article on the basis of the speech playing function item; in response to the speech playing instruction, playing the text content by means of speech; and during the process of playing the text content by means of speech, when the text content involves at least one character, playing the text content, which corresponds to the character, by using a timbre that matches the character features of the character.

Description

文章的语音播放方法、装置、设备、存储介质及程序产品The voice playback method, device, equipment, storage medium and program product of the article
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请实施例基于申请号为202110241752.7、申请日为2021年03月04日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请实施例作为参考。The embodiments of the present application are based on the Chinese patent application with the application number of 202110241752.7 and the filing date of March 4, 2021, and claim the priority of the Chinese patent application. The entire contents of the Chinese patent application are incorporated into the embodiments of the present application as refer to.
技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种文章的语音播放方法、装置、设备、计算机可读存储介质及计算机程序产品。The present application relates to the field of computer technology, and in particular, to a voice playback method, apparatus, device, computer-readable storage medium, and computer program product of an article.
背景技术Background technique
随着互联网技术的发展,基于智能终端的多媒体信息传播也越来越普遍,如,在手机终端呈现文章,供用户阅读。With the development of Internet technology, multimedia information dissemination based on intelligent terminals is becoming more and more common, for example, articles are presented on mobile terminals for users to read.
相关技术中,在用户阅读文章的过程中,为用户提供语音播放功能,也即通过语音播放文章的文本内容,但相关技术中对于文章的所有内容都采用一个声音去朗读,导致用户无法沉浸于文章的内容中。In the related art, when the user reads the article, the user is provided with a voice playback function, that is, the text content of the article is played through the voice, but in the related art, all the content of the article is read aloud by one voice, so that the user cannot be immersed in the content of the article. in the content of the article.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种文章的语音播放方法、装置、设备、计算机可读存储介质及计算机程序产品,能够在通过语音播放文本内容时,让用户感觉身临其境,提升语音播放所带来的沉浸感。Embodiments of the present application provide a voice playback method, device, device, computer-readable storage medium, and computer program product of an article, which can make users feel immersed in the situation when playing text content through voice, and improve the effect of voice playback. of immersion.
本申请实施例的技术方案是这样实现的:The technical solutions of the embodiments of the present application are implemented as follows:
本申请实施例提供一种文章的语音播放方法,包括:The embodiment of the present application provides a voice playback method of an article, including:
在文章的内容界面中,呈现文章的文本内容以及对应所述文章的语音播放功能项;In the content interface of the article, the text content of the article and the voice playback function item corresponding to the article are presented;
接收到基于所述语音播放功能项触发的针对所述文章的语音播放指令;Receive a voice play instruction for the article triggered based on the voice play function item;
响应于所述语音播放指令,通过语音播放所述文本内容;In response to the voice play instruction, play the text content by voice;
在通过语音播放所述文本内容的过程中,当所述文本内容包括至少一个角色时,对于与所述角色对应的文本内容,采用与所述角色的角色特征相匹配的音色进行播放。During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre matching the character characteristics of the character is used for playing.
本申请实施例提供一种文章的语音播放装置,包括:The embodiment of the present application provides a voice playback device of an article, including:
呈现模块,配置为在文章的内容界面中,呈现文章的文本内容以及对应所述文章的语音播放功能项;A presentation module, configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;
接收模块,配置为接收到基于所述语音播放功能项触发的针对所述文章的语音播放指令;a receiving module, configured to receive a voice play instruction for the article triggered based on the voice play function item;
第一播放模块,配置为响应于所述语音播放指令,通过语音播放所述文本内容;a first playing module, configured to play the text content by voice in response to the voice play instruction;
第二播放模块,配置为在通过语音播放所述文本内容的过程中,当所述文本内容包括至少一个角色时,对于与所述角色对应的文本内容,采用与所述角色的角色特征相匹配的音色进行播放。The second playing module is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use a character feature that matches the character of the character. sound to play.
本申请实施例提供一种计算机设备,包括:Embodiments of the present application provide a computer device, including:
存储器,用于存储可执行指令;memory for storing executable instructions;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的文章的语音播放方法。The processor is configured to implement the voice playing method of the article provided by the embodiment of the present application when executing the executable instructions stored in the memory.
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的文章的语音播放方法。Embodiments of the present application provide a computer-readable storage medium storing executable instructions for causing a processor to execute the voice playback method of the article provided by the embodiments of the present application.
本申请实施例提供一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现本申请实施例提供的文章的语音播放方法。The embodiments of the present application provide a computer program product, including computer programs or instructions, which, when executed by a processor, implement the voice playback method of the articles provided by the embodiments of the present application.
本申请实施例具有以下有益效果:The embodiment of the present application has the following beneficial effects:
应用本申请实施例,通过在文章的内容界面中,呈现文章的文本内容以及对应文章的语音播放功能项;接收到基于语音播放功能项触发的针对文章的语音播放指令;响应于语音播放指令,通过语音播放文本内容;在通过语音播放文本内容的过程中,当文本内容包括至少一个角色时,对于与角色对应的文本内容,采用与角色的角色特征相匹配的音色进行播放;如此,由于对文本内容进行播放时,所采用的音色是与该文本内容所对应的角色特征相匹配的,使得用户在听到播放的文本内容时能够声临其境,更能够沉浸到文章的内容中,提高了语音播放所带来的沉浸感。By applying the embodiment of the present application, in the content interface of the article, the text content of the article and the voice playback function item of the corresponding article are presented; the voice playback instruction for the article triggered based on the voice playback function item is received; in response to the voice playback instruction, Play the text content by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre that matches the character characteristics of the character is used to play; When the text content is played, the timbre used is matched with the character characteristics corresponding to the text content, so that the user can be immersed in the scene when listening to the text content being played, and can be more immersed in the content of the article. The immersion brought by voice playback.
附图说明Description of drawings
图1是本申请实施例提供的文章的语音播放系统100的架构示意图;FIG. 1 is a schematic structural diagram of a voice playback system 100 of an article provided by an embodiment of the present application;
图2是本申请实施例提供的计算机设备500的结构示意图;FIG. 2 is a schematic structural diagram of a computer device 500 provided by an embodiment of the present application;
图3是本申请实施例提供文章的语音播放方法的流程示意图;3 is a schematic flowchart of a voice playback method of an article provided by an embodiment of the present application;
图4是本申请实施例提供的内容界面的示意图;4 is a schematic diagram of a content interface provided by an embodiment of the present application;
图5是本申请实施例提供的提示框的呈现示意图;FIG. 5 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application;
图6是本申请实施例提供的提示框的呈现示意图;FIG. 6 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application;
图7是本申请实施例提供的内容界面的示意图;7 is a schematic diagram of a content interface provided by an embodiment of the present application;
图8是本申请实施例提供的内容界面的示意图;8 is a schematic diagram of a content interface provided by an embodiment of the present application;
图9是本申请实施例提供的内容界面的示意图;9 is a schematic diagram of a content interface provided by an embodiment of the present application;
图10是本申请实施例提供的内容界面的示意图;10 is a schematic diagram of a content interface provided by an embodiment of the present application;
图11是本申请实施例提供的内容界面的示意图;11 is a schematic diagram of a content interface provided by an embodiment of the present application;
图12是本申请实施例提供的情感标签的示意图;12 is a schematic diagram of an emotion tag provided by an embodiment of the present application;
图13是本申请实施例提供到的语音参数的示意图;13 is a schematic diagram of speech parameters provided by an embodiment of the present application;
图14是本申请实施例提供的情绪与语音参数对应关系的示意图;14 is a schematic diagram of the correspondence between emotions and speech parameters provided by an embodiment of the present application;
图15是本申请实施例提供的内容界面的示意图;15 is a schematic diagram of a content interface provided by an embodiment of the present application;
图16是本申请实施例提供的内容界面的示意图;16 is a schematic diagram of a content interface provided by an embodiment of the present application;
图17是本申请实施例提供的区块链网络的应用架构示意图;17 is a schematic diagram of an application architecture of a blockchain network provided by an embodiment of the present application;
图18为本申请实施例提供的区块链网络600中区块链的结构示意图;18 is a schematic structural diagram of a blockchain in a blockchain network 600 provided by an embodiment of the present application;
图19为本申请实施例提供的区块链网络600的功能架构示意图;FIG. 19 is a schematic diagram of a functional architecture of a blockchain network 600 provided by an embodiment of the present application;
图20是本申请实施例提供的技术侧实现的流程示意图;FIG. 20 is a schematic flowchart of technical side implementation provided by an embodiment of the present application;
图21A是本申请实施例提供的基频点示意图;21A is a schematic diagram of a fundamental frequency point provided by an embodiment of the present application;
图21B是本申请实施例提供的声调五度值图;FIG. 21B is a diagram of tonal fifths provided by an embodiment of the present application;
图22是申请实施例提供的声学模型训练流程示意图;22 is a schematic diagram of an acoustic model training process provided by an application embodiment;
图23是本申请实施例提供的关键字词典的构建过程示意图;23 is a schematic diagram of a construction process of a keyword dictionary provided by an embodiment of the present application;
图24是本申请实施例提供的基于性格的情感分类模型的示意图;24 is a schematic diagram of a personality-based emotion classification model provided by an embodiment of the present application;
图25是本申请实施例提供的合成音频的流程示意图。FIG. 25 is a schematic flowchart of synthesizing audio provided by an embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail below with reference to the accompanying drawings. All other embodiments obtained under the premise of creative work fall within the scope of protection of the present application.
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" can be the same or a different subset of all possible embodiments, and Can be combined with each other without conflict.
在以下的描述中,所涉及的术语“第一\第二\第三”仅仅是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, the term "first\second\third" is only used to distinguish similar objects, and does not represent a specific ordering of objects. It is understood that "first\second\third" is used in Where permitted, the specific order or sequence may be interchanged to enable the embodiments of the application described herein to be practiced in sequences other than those illustrated or described herein.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。Before further describing the embodiments of the present application in detail, the terms and terms involved in the embodiments of the present application are described, and the terms and terms involved in the embodiments of the present application are suitable for the following explanations.
1)角色特征,用于表征角色所对应的人物特点的特征,也可以理解为角色的人物画像特征,根据角色的性别,年龄、身份等角色基础信息抽象出的标签化的人物的信息全貌;如,角色特征可以包括:年龄特征、身份特征、性别特征、性格特征、健康状况特征等。1) Character characteristics, which are used to characterize the characteristics of the characters corresponding to the characters, and can also be understood as the character portrait characteristics of the characters, and the whole information of the tagged characters abstracted according to the basic information of the characters, such as the gender, age, and identity of the characters; For example, the character characteristics may include age characteristics, identity characteristics, gender characteristics, personality characteristics, health status characteristics, and the like.
2)交易(Transaction),等同于计算机术语“事务”,交易包括了需要提交到区块链网络执行的操作,并非单指商业语境中的交易,鉴于在区块链技术中约定俗成地使用了“交易”这一术语,本申请实施例遵循了这一习惯。2) Transaction (Transaction), which is equivalent to the computer term "transaction". Transaction includes operations that need to be submitted to the blockchain network for execution, not just transactions in a business context. In view of the conventional use of blockchain technology The term "transaction" is used in the embodiments of the present application.
3)区块链(Blockchain),是由区块(Block)形成的加密的、链式的交易的存储结构。3) Blockchain is a storage structure of encrypted and chained transactions formed by blocks.
4)区块链网络(Blockchain Network),通过共识的方式将新区块纳入区块链的一系列的节点的集合。4) Blockchain Network, a set of nodes that incorporate new blocks into the blockchain through consensus.
5)账本(Ledger),是区块链(也称为账本数据)和与区块链同步的状态数据库的统称。5) Ledger is a general term for blockchain (also known as ledger data) and a state database synchronized with the blockchain.
6)智能合约(Smart Contracts),也称为链码(Chaincode)或应用代码,部署在区块链网络的节点中的程序,节点执行接收的交易中所调用的智能合约,来对状态数据库的键值对数据进行更新或查询的操作。6) Smart Contracts, also known as Chaincode or application code, are programs deployed in the nodes of the blockchain network, and the nodes execute the smart contracts called in the received transactions to update the state database. Key-value operations to update or query data.
7)共识(Consensus),是区块链网络中的一个过程,用于在涉及的多个节点之间对区块中的交易达成一致,达成一致的区块将被追加到区块链的尾部,实现共识的机制包括工作量证明(PoW,Proof of Work)、权益证明(PoS,Proof of Stake)、股份授权证明(DPoS,Delegated Proof-of-Stake)、消逝时间量证明(PoET,Proof of Elapsed Time)等。7) Consensus is a process in the blockchain network used to reach agreement on the transactions in the block among the multiple nodes involved, and the agreed block will be appended to the end of the blockchain , the mechanisms for achieving consensus include Proof of Work (PoW, Proof of Work), Proof of Stake (PoS, Proof of Stake), Proof of Share Authorization (DPoS, Delegated Proof-of-Stake), Proof of Elapsed Time (PoET, Proof of Elapsed Time), etc.
参见图1,图1是本申请实施例提供的文章的语音播放系统100的架构示意图,为实现支撑一个示例性应用,终端(示例性示出了终端400-1和终端400-2)通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合。Referring to FIG. 1, FIG. 1 is a schematic diagram of the architecture of the voice playback system 100 of the article provided by the embodiment of the present application. In order to support an exemplary application, a terminal (exemplarily shows a terminal 400-1 and a terminal 400-2) through the network 300 is connected to the server 200, and the network 300 may be a wide area network or a local area network, or a combination of the two.
终端,用于在文章的内容界面中,呈现文章的文本内容以及对应文章的语音播放功能项;接收到基于语音播放功能项触发的针对文章的语音播放指令;文本内容的语音获取请求至服务器;a terminal, used for presenting the text content of the article and the voice playback function item corresponding to the article in the content interface of the article; receiving a voice playback instruction for the article triggered based on the voice playback function item; the voice acquisition request of the text content is sent to the server;
服务器200,用于响应于语音获取请求,生成文本内容的语音,并将生成的文本内容的语音发送至终端;The server 200 is configured to generate the voice of the text content in response to the voice acquisition request, and send the generated voice of the text content to the terminal;
终端,用于根据接收到的语音,通过语音播放文本内容,并在通过语音播放文本内容的过程中,当文本内容包括至少一个角色时,对于与角色对应的文本内容,采用与角色的角色特征相匹配的音色进行播放。The terminal is used to play the text content through the voice according to the received voice, and in the process of playing the text content through the voice, when the text content includes at least one character, for the text content corresponding to the character, use the character feature of the character. match the tone to play.
在一些实施例中,服务器200可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。In some embodiments, the server 200 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN, Content Delivery Network), and big data and artificial intelligence platforms. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
参见图2,图2是本申请实施例提供的计算机设备500的结构示意图,在实际应用中,计算机设备500可以为图1中的终端或服务器200,以计算机设备为图1所示的终端为例,对实施本申请实施例的文章的语音播放方法的计算机设备进行说明。图2所示的计算机设备500包括:至少一个处理器510、存储器550、至少一个网络接口520和用户接口530。电子设备500中的各个组件通过总线系统540耦合在一起。可理解,总线系统540配置为实现这些组件之间的连接通信。总线系统540除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统540。Referring to FIG. 2, FIG. 2 is a schematic structural diagram of a computer device 500 provided by an embodiment of the present application. In practical applications, the computer device 500 may be the terminal or server 200 in FIG. 1, and the computer device is the terminal shown in FIG. 1. For example, a computer device that implements the voice playback method of the article in the embodiment of the present application will be described. The computer device 500 shown in FIG. 2 includes: at least one processor 510 , memory 550 , at least one network interface 520 and user interface 530 . The various components in electronic device 500 are coupled together by bus system 540 . It will be appreciated that the bus system 540 is configured to enable connection communication between these components. In addition to the data bus, the bus system 540 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 540 in FIG. 2 .
处理器510可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。The processor 510 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
用户接口530包括使得能够呈现媒体内容的一个或多个输出装置531,包括一个或多个扬声器和/或一个或多个视觉显示屏。用户接口530还包括一个或多个输入装置532,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。User interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual display screens. User interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
存储器550可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器550可选地包括在物理位置上远离处理器510的一个或多个存储设备。Memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 optionally includes one or more storage devices that are physically remote from processor 510 .
存储器550包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器550旨在包括任意适合类型的存储器。Memory 550 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory. The non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory). The memory 550 described in the embodiments of the present application is intended to include any suitable type of memory.
在一些实施例中,存储器550能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
操作系统551,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;The operating system 551 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
网络通信模块552,用于经由一个或多个(有线或无线)网络接口520到达其他计算设备,示例性的网络接口520包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;A network communication module 552 for reaching other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: Bluetooth, Wireless Compatibility (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;
呈现模块553,用于经由一个或多个与用户接口530相关联的输出装置531(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);A presentation module 553 for enabling presentation of information (eg, a user interface for operating peripherals and displaying content and information) via one or more output devices 531 associated with the user interface 530 (eg, a display screen, speakers, etc.) );
输入处理模块554,用于对一个或多个来自一个或多个输入装置532之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。An input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.
在一些实施例中,本申请实施例提供的文章的语音播放装置可以采用软件方式实现,图2示出了存储在存储器550中的文章的语音播放装置555,其可以是程序和插件等形式的软件,包括以下软件模块:呈现模块5551、接收模块5552、第一播放模块5553和第二播放模块5554,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。In some embodiments, the voice playback device for articles provided by the embodiments of the present application may be implemented in software. FIG. 2 shows the voice playback device 555 for articles stored in the memory 550, which may be in the form of programs and plug-ins. The software includes the following software modules: presentation module 5551, receiving module 5552, first playing module 5553 and second playing module 5554. These modules are logical, so any combination or further splitting can be performed according to the realized functions. The function of each module will be explained below.
在另一些实施例中,本申请实施例提供的文章的语音播放装置可以采用硬件方式实现,作为示例,本申请实施例提供的文章的语音播放装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的文章的语音播放方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。In other embodiments, the voice playback device of the article provided by the embodiment of the present application may be implemented in hardware. As an example, the voice playback device of the article provided by the embodiment of the present application may be a processor in the form of a hardware decoding processor , which is programmed to execute the voice playback method of the article provided in the embodiment of the present application, for example, the processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, Programmable Logic Device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
接下来对本申请实施例的提供的文章的语音播放方法进行说明,在实际实施时,本申请实施例提供的文章的语音播放方法可由终端单独实施,还可由服务器及终端协同实施。Next, the voice playing method of the article provided by the embodiment of the present application will be described. In actual implementation, the voice playing method of the article provided by the embodiment of the present application may be implemented by the terminal alone, or by the server and the terminal collaboratively.
参见图3,图3是本申请实施例提供文章的语音播放方法的流程示意图,将结合图3示出的步骤进行说明。Referring to FIG. 3 , FIG. 3 is a schematic flowchart of a voice playback method of an article provided by an embodiment of the present application, which will be described with reference to the steps shown in FIG. 3 .
步骤301:终端在文章的内容界面中,呈现文章的文本内容以及对应文章的语音播放功能项。Step 301: The terminal presents the text content of the article and the voice playback function item corresponding to the article in the content interface of the article.
在实际实施时,终端上设置有客户端,如阅读客户端、即时通讯客户端等,终端可以通过客户端呈现文章的文本内容。这里,文章可以是小说、散文、科普类文章等,文本内容是指书面语言的表现形式,是指具有特定含义的一个或多个字符,例如,文本内容可以是具有特定含义的字、词、短语、句子、段落或篇章。In actual implementation, the terminal is provided with a client, such as a reading client, an instant messaging client, etc., and the terminal can present the text content of the article through the client. Here, the articles can be novels, prose, popular science articles, etc. The text content refers to the expression of written language, and refers to one or more characters with specific meanings. For example, the text content can be words, words, Phrases, sentences, paragraphs or articles.
这里,终端在呈现文章的文本内容的同时,还可以呈现对应文章的语音播放功能项,该语音播放功能项,用于在接收到触发操作时,通过语音播放文本内容。Here, while presenting the text content of the article, the terminal may also present a voice play function item corresponding to the article, and the voice play function item is used to play the text content by voice when a trigger operation is received.
作为示例,图4是本申请实施例提供的内容界面的示意图,参见图4,在文章的内容界面中,呈现文章的文本内容401及对应文章的播放功能项402。As an example, FIG. 4 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 4 , in the content interface of the article, the text content 401 of the article and the playback function item 402 of the corresponding article are presented.
步骤302:接收到基于语音播放功能项触发的针对文章的语音播放指令。Step 302: Receive a voice play instruction for the article triggered based on the voice play function item.
在实际实施时,用户在阅读呈现的文章的文本内容时,可以基于语音播放功能项触发针对文章的语音播放指令,这里可以基于针对语音播放功能项的触发操作来触发针对文章的语音播放指令,其中,触发操作包括但不限于:点击操作、双击操作、滑动操作等,本申请实施例并不对触发操作进行限定。例如,当用户点击图4中的语音播放功能项402时,即可触发针对文章的语音播放指令。In actual implementation, when the user reads the text content of the presented article, the voice playback instruction for the article can be triggered based on the voice playback function item. Here, the voice playback instruction for the article can be triggered based on the trigger operation for the voice playback function item, The trigger operation includes, but is not limited to, a click operation, a double-click operation, a slide operation, and the like, and the embodiment of the present application does not limit the trigger operation. For example, when the user clicks the voice play function item 402 in FIG. 4 , the voice play instruction for the article can be triggered.
步骤303:响应于语音播放指令,通过语音播放文本内容。Step 303: In response to the voice play instruction, play the text content by voice.
在实际实施时,终端在接收到语音播放指令时,获取对应文本内容的语音数据,对语音数据进行播放,以实现通过语音播放文本内容。In actual implementation, when the terminal receives the voice play instruction, it acquires voice data corresponding to the text content, and plays the voice data, so as to play the text content through voice.
这里,语音数据是基于文本内容生成的,其中,生成语音数据的过程可以是在终端执行的,也可以是在服务器执行的,如可以是终端响应于语音播放指令,生成并发送针对文章的语音播放请求至服务器,其中,语音播放请求携带文章的标识,服务器基于语音播放请求携带的文章的标识,获取相应的文章的文本内容,并基于文本内容生成语音数据,并将生成的语音数据返回给终端,由终端播放该语音数据。需要说明的是,本申请播放的语音数据是智能生成的,而不是预先通过语音录制文章生成的。Here, the voice data is generated based on the text content, and the process of generating the voice data may be performed on the terminal or on the server. For example, the terminal may generate and send the voice for the article in response to the voice playback instruction. The playback request is sent to the server, wherein the voice playback request carries the identifier of the article, and the server obtains the text content of the corresponding article based on the identifier of the article carried by the voice playback request, and generates voice data based on the text content, and returns the generated voice data to the The terminal plays the voice data. It should be noted that the voice data played in this application is generated intelligently, rather than pre-generated through voice recording of articles.
在一些实施例中,当终端接收到语音播放指令时,开始通过语音播放文本内容,在通过语音播放文本内容的过程中,可以呈现提示信息,以提示用户正在通过语音播放文本内容。In some embodiments, when the terminal receives a voice play instruction, it starts to play the text content by voice, and during the process of playing the text content by voice, prompt information may be presented to prompt the user that the text content is being played by voice.
这里,提示信息的形式可以有多种,如提示信息可以是文本形式的、可以是图像形式的等。并且,提示信息的呈现方式也可以有多种,例如,可以悬浮形式呈现提示信息,也可以是在内容界面中的某一呈现区域呈现提示信息,如在内容界面的顶部呈现提示信息,本申请实施例并不对提示信息的呈现形式进行限定。Here, the prompt information can be in a variety of forms, for example, the prompt information can be in the form of text, or in the form of images. In addition, there may be various ways of presenting the prompt information. For example, the prompt information can be presented in a floating form, or the prompt information can be presented in a certain presentation area in the content interface, such as the prompt information is presented at the top of the content interface. The embodiment does not limit the presentation form of the prompt information.
在一些实施例中,当提示信息为文本形式时,终端在通过语音播放文本内容的过程中,以悬浮形式呈现提示框,并在提示框中呈现文本提示信息;其中,文本提示信息,用于提示正在通过语音播放文本内容。In some embodiments, when the prompt information is in the form of text, during the process of playing the text content through voice, the terminal presents a prompt box in a floating form, and presents the text prompt information in the prompt box; wherein the text prompt information is used for Indicates that text content is being played by voice.
在实际实施时,提示框的呈现形式为悬浮形式,也即提示框是独立于内容界面的,且悬浮于内容界面之上。作为示例,图5是本申请实施例提供的提示框的呈现示意图,参见图5,以悬浮形式呈现提示框501,并在提示框501中呈现文本提示信息“您收听的是智能识别听书”。In actual implementation, the presentation form of the prompt box is a floating form, that is, the prompt box is independent of the content interface and is suspended above the content interface. As an example, FIG. 5 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application. Referring to FIG. 5 , a prompt box 501 is presented in a floating form, and a text prompt message “You are listening to an intelligent recognition audiobook” is presented in the prompt box 501 .
这里,由于提示框是以悬浮形式呈现的,提示框是可移动的,也即用户可以触发针对悬浮框的移动操作,当接收到用户触发的针对提示框的移动操作后,控制提示框移动,相应的,提示信息随着提示框的移动而移动;如此,当提示框遮挡住用户想要浏览的内容时,可以移动该提示框,以避免提示框遮挡用户想要浏览的内容,提高了用户的阅读体验。Here, since the prompt box is presented in a floating form, the prompt box is movable, that is, the user can trigger the moving operation for the floating box. Correspondingly, the prompt information moves with the movement of the prompt box; in this way, when the prompt box blocks the content that the user wants to browse, the prompt box can be moved to avoid the prompt box from blocking the content that the user wants to browse, thereby improving the user experience. reading experience.
在实际应用中,提示框的呈现时间可以与通过语音播放文本内容的开始时间相同,也即在通过语音播放文本内容的同时呈现提示框。其中,提示框的呈现时长可以是预先设置的,也即,在提示框的呈现时长达到预设时长时,取消显示该提示框;提示框的呈现时长也可以是与通过语音播放文本内容的时长相一致的,也即在通过语音播放文本内容的过程中始终呈现该提示框,当停止通过语音播放文本内容时,取消呈现该提示框;提示框的呈现时长还可以是由用户控制的,也即,用户在触发针对提示框的关闭操作时,取消呈现该提示框。In practical applications, the presentation time of the prompt box may be the same as the start time of playing the text content by voice, that is, the prompt box is presented while the text content is played by voice. The presentation duration of the prompt box may be preset, that is, when the presentation duration of the prompt box reaches the preset duration, the prompt box will be cancelled; the presentation duration of the prompt box may also be the same as the duration of playing the text content by voice Consistent, that is, the prompt box is always presented in the process of playing the text content by voice, and when the text content is stopped by voice, the prompt box is canceled; the presentation time of the prompt box can also be controlled by the user, or That is, when the user triggers the close operation for the prompt box, the prompt box is canceled.
在一些实施例中,在呈现提示框的过程中,可以对提示框的呈现样式和/或提示框中的呈现内容进行调整,其中,提示框的呈现样式包括提示框的形状、尺寸、呈现位置等。In some embodiments, in the process of presenting the prompt box, the presentation style of the prompt box and/or the presentation content in the prompt box may be adjusted, wherein the presentation style of the prompt box includes the shape, size, and presentation position of the prompt box. Wait.
在一些实施例中,当文本提示信息的呈现时长达到时长阈值时,终端收缩提示框,并将提示框中的文本提示信息切换为播放图标,其中,播放图标用于指示正在通过语音播放文本内容。In some embodiments, when the presentation duration of the text prompt information reaches the duration threshold, the terminal shrinks the prompt box, and switches the text prompt information in the prompt box to a play icon, wherein the play icon is used to indicate that the text content is being played by voice .
在实际实施时,时长阈值可以是预先设置的,如系统设置的、用户设置的等,当文本提示信息呈现后开始计时,以确定文本提示信息的呈现时长,在呈现时长达到时长阈值时,调整提示框的呈现样式和呈现内容,也即收缩提示框,以缩小提示框的尺寸,并将呈现的文本提示信息切换为播放图标。这里,收缩后的提示框的尺寸与提示框中的呈现内容相适配。In actual implementation, the duration threshold can be preset, such as system settings, user settings, etc., when the text prompt information is presented, start timing to determine the presentation duration of the text prompt information, and adjust when the presentation duration reaches the duration threshold. The presentation style and content of the prompt box, that is, shrink the prompt box to reduce the size of the prompt box, and switch the presented text prompt information to the play icon. Here, the size of the shrunk prompt box is adapted to the content presented in the prompt box.
作为示例,图6是本申请实施例提供的提示框的呈现示意图,参见图6,假设时长阈值为10秒,当图5中的文本提示信息的呈现时长达到10秒时,将图5中的“您收听的是智能识别听书”这一文本提示信息切换为图6中的播放图标61,同时收缩提示框,以使提示框的尺寸与提示框中的内容尺寸相适配。As an example, FIG. 6 is a schematic diagram of the presentation of a prompt box provided by an embodiment of the present application. Referring to FIG. 6 , assuming that the duration threshold is 10 seconds, when the presentation duration of the text prompt information in FIG. 5 reaches 10 seconds, the The text prompt message "You are listening to an audiobook with intelligent recognition" is switched to the play icon 61 in FIG. 6 , and the prompt box is shrunk so that the size of the prompt box matches the size of the content in the prompt box.
本申请实施例通过当文本提示信息的呈现时长达到时长阈值时,收缩提示框,并将提示框中的文本提示信息切换为指示正在通过语音播放文本内容的播放图标,避免了由于文本提示信息内容过多,提示框长时间遮盖过多的文本内容,进而影响针对文本内容的阅读体验的情况发生。In the embodiment of the present application, when the presentation duration of the text prompt information reaches the duration threshold, the prompt box is shrunk, and the text prompt information in the prompt box is switched to a play icon indicating that the text content is being played by voice, so as to avoid the problems caused by the content of the text prompt information. Too much, the prompt box covers too much text content for a long time, and then affects the reading experience of the text content.
步骤304:在通过语音播放文本内容的过程中,当文本内容包括至少一个角色时,对于与角色对应的文本内容,采用与角色的角色特征相匹配的音色进行播放。Step 304: During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use the timbre matching the character characteristics of the character to play.
这里,与角色对应的文本内容指的是与角色相关联的文本内容,如该角色的对话内容、内心独白、描述内容等;角色特征可以是通过角色的至少两种基础信息抽象得到的标签,与角色的基础信息画像对应,例如,角色特征可以包括对角色的年龄信息、性别信息、身份信息(如霸道总裁)、性别信息、性格信息、健康状况信息,抽象得到的年龄特征、身份特征、性别特征、性格特征、健康状况特征。Here, the text content corresponding to the character refers to the text content associated with the character, such as the character's dialogue content, inner monologue, description content, etc.; the character feature can be a label abstracted from at least two basic information of the character, Corresponding to the basic information portrait of the character, for example, the character characteristics may include age information, gender information, identity information (such as a domineering president), gender information, personality information, health status information, abstract age characteristics, identity characteristics, Gender characteristics, personality characteristics, health status characteristics.
在实际实施时,文本内容包括的角色数量可以为一个或者多个,其中,多个为两个或两个以上,当文本内容包括多个角色时,角色与音色呈一一对应关系。In actual implementation, the number of characters included in the text content may be one or more, wherein the number of characters is two or more. When the text content includes multiple characters, the characters and the timbres are in a one-to-one correspondence.
在实际应用中,对于各角色的文本内容,采用与角色的角色特征相匹配的音色进行播放,即获取多个角色的角色特征,然后分别将各角色的角色特征与音色进行匹配,以确定与各角色的角色特征相匹配的音色;通过获取的音色,对相应角色对应的文本内容进行播放。In practical applications, the text content of each character is played with the timbre that matches the character's character characteristics, that is, the character characteristics of multiple characters are obtained, and then the character characteristics of each character are matched with the timbre respectively to determine the The timbre that matches the character characteristics of each character; through the acquired timbre, the text content corresponding to the corresponding character is played.
这里,在将各角色的角色特征与音色进行匹配时,是将各角色的角色特征与音色对应的角色特征进行匹配;在一些实施例中,角色特征可以采用相应的标签(即角色标签)进行标识,例如,对于年龄特征采用年龄标签进行标识,对于身份特征采用身份标签标识,相应的,本申请中一个特定角色的角色特征包括至少两种,也即一个特定角色的角色特征可以有至少两种标签。在实际实施时,可以预先存储多种(即至少两种)音色,每种音色对应有至少两种标签,在进行角色特征匹配时,可以将该角色所对应的至少两个标签与各个音色所对应的标签进行匹配,以确定与角色的角色特征相匹配的音色。Here, when the character characteristics of each character are matched with the timbre, the character characteristics of each character are matched with the character characteristics corresponding to the timbre; in some embodiments, the character characteristics can be performed by using corresponding tags (ie, character tags). Identification, for example, an age tag is used for identification, and an identity tag is used for identification. Correspondingly, the role characteristics of a specific role in this application include at least two types, that is, the role characteristics of a specific role can have at least two types. kind of labels. In actual implementation, multiple (that is, at least two) timbres can be pre-stored, and each timbre corresponds to at least two labels. During character feature matching, at least two labels corresponding to the character can be associated with the respective timbres. The corresponding tags are matched to determine the timbre that matches the character's character traits.
在实际应用中,当与某一角色的角色特征相匹配的音色存在至少两个时,可以随机从匹配得到的至少两个音色中选择一个作为目标音色,采用选择的目标音色对该角色对应的文本内容进行播放;还可以获取各音色与角色特征的匹配度,根据匹配度高低,选取与角色特征的匹配度最高的音色作为目标音色,采用选择的目标音色对该角色对应的文本内容进行播放;也可以是呈现匹配得到的至少两个音色所对应的选择项供用户选择,将用户选择的音色作为目标音色,采用选择的目标音色对该角色对应的文本内容进行播放。In practical applications, when there are at least two timbres matching the character characteristics of a character, one of the at least two timbres obtained by matching can be randomly selected as the target timbre, and the selected target timbre corresponding to the character can be used. The text content is played; it is also possible to obtain the matching degree of each timbre and the character characteristics. According to the matching degree, select the timbre with the highest matching degree with the character characteristics as the target timbre, and use the selected target timbre to play the text content corresponding to the character. ; It can also be to present options corresponding to at least two timbres obtained by matching for the user to select, take the timbre selected by the user as the target timbre, and use the selected target timbre to play the text content corresponding to the character.
在一些实施例中,为实现采用目标音色对相应角色对应的文本内容进行播放,可以先确定对话内容中每个字如何发音,然后再添加目标音色的音色特征,以基于目标音色的音色特征,生成文本内容的语音,进而播放生成的语音。In some embodiments, in order to use the target timbre to play the text content corresponding to the corresponding character, it is possible to first determine how to pronounce each word in the dialogue content, and then add the timbre feature of the target timbre, so that based on the timbre feature of the target timbre, Generates speech for text content, and then plays the generated speech.
在一些实施例中,终端可以响应于针对文本内容中目标内容的选定操作,呈现对应目标内容的至少两个音色选项;其中,每个音色选项对应一种音色;响应于基于至少两个音色选项触发的音色选取操作,将所选取的目标音色作为目标内容所对应的角色的音色,以在通过语音播放文本内容的过程中,对于目标内容所对应的角色对应的文本内容,采用目标音色进行播放。In some embodiments, the terminal may present at least two timbre options corresponding to the target content in response to a selected operation on the target content in the text content; wherein each timbre option corresponds to a timbre; in response to the at least two timbre options The timbre selection operation triggered by the option takes the selected target timbre as the timbre of the character corresponding to the target content, so that in the process of playing the text content through voice, the target timbre is used for the text content corresponding to the character corresponding to the target content. play.
在实际实施时,用户可以自行选取某一角色的音色,以使终端在播放该角色对应的文本内容时,采用用户选取的音色进行播放。首先,用户基于呈现的文本内容选择需要进行音色选取的角色,这里通过选取文本内容来选取角色,也即将选取的目标内容所对应的角色作为选取的角色。然后,在确定目标内容后,呈现对应该目标内容的至少两个音色选项,这里在呈现音色选项时,可以根据各音色与目标内容所对应的角色的角色特征之间的匹配程度高低,对音色选项进行呈现,如与目标内容所对应的角色的角色特征之间的匹配程度越高的音色所对应的音色选项的呈现位置越靠前。接着,用户基于呈现的至少两个音色选项,选择所要选择的音色,这里的选定操作可以是针对目标音色所对应的音色选项的点击操作,还可以针对目标音色所对应的音色选项的按压操作,这里 不对选定操作的触发形式进行限定。In actual implementation, the user can select the timbre of a character by himself, so that when the terminal plays the text content corresponding to the character, the timbre selected by the user is used for playing. First, the user selects the character whose timbre needs to be selected based on the presented text content. Here, the character is selected by selecting the text content, that is, the character corresponding to the selected target content is taken as the selected character. Then, after the target content is determined, at least two timbre options corresponding to the target content are presented. Here, when the timbre options are presented, according to the degree of matching between each timbre and the character characteristics of the character corresponding to the target content, the timbre options For example, the timbre option corresponding to the timbre with a higher degree of matching with the character characteristics of the character corresponding to the target content is presented at the front. Next, the user selects the timbre to be selected based on the presented at least two timbre options. The selection operation here may be a click operation on the timbre option corresponding to the target timbre, or a pressing operation on the timbre option corresponding to the target timbre , the trigger form of the selected operation is not limited here.
在实际应用中,对应该目标内容的至少两个音色选项可以以下拉列表形式呈现,也可以图标形式呈现,还可以以图像形式呈现,这里不对至少两个音色选项的呈现形式做限定。这里,可以直接在内容界面中呈现至少两个音色选项,也可以是呈现一个独立于内容界面之上的浮层,在浮层中呈现至少两个音色选项。In practical applications, the at least two timbre options corresponding to the target content may be presented in the form of a drop-down list, an icon, or an image. The presentation forms of the at least two timbre options are not limited here. Here, at least two timbre options can be presented directly in the content interface, or a floating layer independent of the content interface can be presented, and at least two timbre options can be presented in the floating layer.
需要说明的是,上述针对目标内容的选定操作及音色选取操作可以在通过语音播放文本内容之前执行,也可以在通过语音播放文本内容的过程中执行。It should be noted that, the above-mentioned selection operation for the target content and tone selection operation may be performed before the text content is played by voice, and may also be performed during the process of playing the text content by voice.
作为示例,图7是本申请实施例提供的内容界面的示意图,参见图7,用户基于呈现的文本内容选择目标内容,这里可以通过点击文字来选择目标内容,也即当接收到用户的点击操作,将点击位置呈现的语句作为目标内容,并呈现一个浮层,在浮层中呈现至少两个音色选项701,这里音色选项以图文结合的形式呈现,也即呈现包含与音色相匹配的卡通人物的图像,并呈现音色的文本描述,如傻白甜型。As an example, FIG. 7 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 7 , the user selects the target content based on the presented text content. Here, the target content can be selected by clicking on the text, that is, when the user's click operation is received , take the statement presented at the clicked position as the target content, and present a floating layer in which at least two timbre options 701 are presented. An image of the character, and a textual description of the timbre presented, such as Silly White Sweet.
在一些实施例中,用户在选取音色之前,可以对各音色进行试听,即终端呈现对应目标内容的至少两个音色选项之后,还可以呈现至少两个音色的试听功能项;响应于针对目标音色对应的试听功能项的触发操作,采用试听功能项对应的目标音色播放目标内容。In some embodiments, before selecting a timbre, the user can audition each timbre, that is, after the terminal presents at least two timbre options corresponding to the target content, it can also present at least two timbre options; in response to the target timbre For the trigger operation of the corresponding audition function item, the target content is played by using the target timbre corresponding to the audition function item.
在实际实施时,每个音色选项可以对应一个试听功能项,当用户触发某一试听功能项后,确定该试听功能项对应的目标音色,然后基于该目标音色播放目标内容。In actual implementation, each timbre option may correspond to an audition function item. After the user triggers a certain audition function item, the target timbre corresponding to the audition function item is determined, and then the target content is played based on the target timbre.
作为示例,图8是本申请实施例提供的内容界面的示意图,参见图8,在用户基于呈现的文本内容选择目标内容,这里可以通过点击文字来选择目标内容,也即当接收到用户的点击操作,将点击位置呈现的语句作为目标内容,并呈现一个浮层,在浮层中呈现至少两个音色选项801,这里音色选项以图文结合的形式呈现,也即呈现包含与音色相匹配的卡通人物的图像,并呈现音色的文本描述,如傻白甜型;并在每个音色选项下方呈现一个试听功能项802,试听功能项与音色选项一一对应,例如,当用户点击位于傻白甜型的音色选项下方的试听功能项时,采用傻白甜型音色播放目标内容,也即选择的语句。As an example, FIG. 8 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 8 , when the user selects the target content based on the presented text content, the target content can be selected by clicking on the text, that is, when the user's click is received Operation, take the statement presented at the clicked position as the target content, and present a floating layer, and present at least two timbre options 801 in the floating layer. An image of a cartoon character, and a textual description of the timbre, such as silly white sweet type; and an audition function item 802 is presented under each timbre option, and the audition function items correspond to the timbre options one-to-one. For example, when the user clicks on the When the audition function item under the sweet tone option is used, the target content, that is, the selected sentence, is played with the silly white sweet tone.
在一些实施例中,终端可以响应于针对呈现的对话内容中目标内容的选定操作,呈现对应目标内容的至少两个音色选项及确定功能项;其中,每个音色选项对应一种音色;响应于基于至少两个音色选项触发的音色选取操作,采用所选取的目标音色播放目标内容;响应于针对确定功能项的触发操作,将目标音色作为目标内容所对应的角色的音色,以在通过语音播放文本内容的过程中,对于目标内容所对应的角色的对话内容,采用目标音色进行播放。In some embodiments, the terminal may present at least two timbre options corresponding to the target content and a determination function item in response to a selection operation on the target content in the presented dialogue content; wherein each timbre option corresponds to a timbre; the response In the timbre selection operation triggered based on at least two timbre options, the selected target timbre is used to play the target content; in response to the triggering operation for determining the function item, the target timbre is used as the timbre of the role corresponding to the target content, so as to pass the voice During the process of playing the text content, the dialogue content of the character corresponding to the target content is played using the target timbre.
在实际应用中,用户在触发确定功能项之前,可以切换选取的音色,且媒体选取音色后,都会采用所选取的音色播放目标内容,如此,用户可以根据播放声音,判断是否要选择该音色,避免选取错误后需要重新选取,提高了人机交互效率。In practical applications, the user can switch the selected timbre before triggering the confirmation function item, and after the media selects the timbre, the selected timbre will be used to play the target content. In this way, the user can determine whether to select the timbre according to the playing sound. It avoids the need to re-select after selection error, and improves the efficiency of human-computer interaction.
在一些实施例中,在文章的内容界面中,呈现音色选取功能项;响应于针对音色选取功能项的触发操作,呈现文章中的至少两个角色;响应于针对至少两个角色中目标角色的选取操作,呈现与目标角色对应的至少两个音色;响应于基于至少两个音色触发的音色选取操作,将所选取的目标音色作为目标角色的音色,以在通过语音播放文本内容的过程中,对于目标角色的对话内容,采用所选取的目标音色进行播放。In some embodiments, in the content interface of the article, a timbre selection function item is presented; in response to a trigger operation for the timbre selection function item, at least two characters in the article are presented; in response to a target character in the at least two characters The selection operation presents at least two timbres corresponding to the target roles; in response to the timbre selection operations triggered based on the at least two timbres, the selected target timbres are used as the timbres of the target roles, so that in the process of playing the text content by voice, For the dialogue content of the target character, use the selected target voice to play.
在实际应用中,终端在接收到音色选取功能项后,还可呈现文章中至少两个角色,这里,可以呈现文章中的所有角色,也可以仅呈现文章中的部分角色,如仅呈现当前呈现文本内容所处章节中出现的角色。在呈现文章中的至少两个角色后,用户可以从中选择一个作为目标角色,以选取该目标角色的目标音色。这里,在为一个角色选择目标音 色之后,还可以再从至少两个角色中选择其它角色,为其它角色选择音色。In practical applications, after receiving the timbre selection function item, the terminal can also present at least two characters in the article. Here, all characters in the article can be presented, or only some characters in the article can be presented. For example, only the current presentation can be presented. The role that appears in the chapter in which the text content is located. After presenting at least two characters in the article, the user can select one of them as the target character to select the target timbre of the target character. Here, after the target timbre is selected for one role, other roles may be selected from at least two roles to select timbres for other roles.
如此,用户不仅可以对当前内容界面中会话内容所对应的角色的音色进行选择,还可以对未呈现的会话内容所对应的角色选择音色,从而通过一次触发音色选取功能项,可以对呈现的多个角色的音色进行选择,提高了人机交互效率。In this way, the user can not only select the timbre of the character corresponding to the session content in the current content interface, but also can select the timbre of the character corresponding to the unpresented session content, so that by triggering the timbre selection function item once, it is possible to select the timbre of the multiple presented timbres. The timbre of each character is selected, which improves the efficiency of human-computer interaction.
作为示例,图9是本申请实施例提供的内容界面的示意图,参见图9,在内容界面呈现音色选取功能项901,当用户点击该音色选取功能项901后,呈现音色选取界面,并在音色选取界面中呈现文章中的所有角色902,如角色A、角色B、角色C;当用户点击某一角色,如点击“角色A”,呈现与“角色A”的角色特征相匹配的多个音色903,用户可以从呈现的多个音色中选择一个音色作为目标音色。As an example, FIG. 9 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 9 , a timbre selection function item 901 is presented on the content interface. The selection interface presents all characters 902 in the article, such as character A, character B, and character C; when the user clicks on a character, such as clicking on "character A", multiple timbres matching the character characteristics of "character A" are presented 903, the user may select a timbre from the multiple presented timbres as the target timbre.
在一些实施例中,终端还可以在通过语音播放文本内容的过程中,呈现针对文本内容的音色切换按键;当接收到针对音色切换按键的触发操作时,将当前播放内容所对应的音色由第一音色切换为第二音色。In some embodiments, the terminal may also present a timbre switching button for the text content during the process of playing the text content by voice; when receiving a trigger operation for the timbre switching button, the terminal may change the timbre corresponding to the currently playing content from the first One tone switches to the second tone.
在实际实施时,本申请实施例提供一个快速切换音色的按键,即音色切换按键,在语音播放过程中,该音色切换按键用于对当前正在播放的语句对应的音色进行切换,其中,第一音色是正在播放的音色,第二音色是推荐的供切换的音色,这里的第一音色不同于第二音色。In actual implementation, the embodiment of the present application provides a button for quickly switching the timbre, that is, the timbre switching button. During the voice playback process, the timbre switching button is used to switch the timbre corresponding to the sentence currently being played, wherein the first The timbre is the currently playing timbre, and the second timbre is the recommended timbre for switching, where the first timbre is different from the second timbre.
在实际应用中,第二音色与当前播放的语句相对应,不同语句对应的第二音色可以相同,也可以不同。这里,第一音色和第二音色可以都为与当前播放内容所对应的角色的角色特征相匹配的音色,例如,在播放至某一对话内容时,获取与该对话内容所对应的角色的角色特征相匹配的多个音色,然后从多个音色中选择一个作为第一音色,并选择一个作为第二音色,先采用第一音色播放该对话内容,当接收到针对切换按键的触发操作后,将第一音色切换为第二音色,即切换后采用第二音色播放该对话内容。In practical applications, the second timbre corresponds to the currently playing sentence, and the second timbres corresponding to different sentences may be the same or different. Here, both the first timbre and the second timbre may be timbres that match the character characteristics of the character corresponding to the currently playing content. For example, when a certain dialogue content is played, the character of the character corresponding to the dialogue content is obtained. multiple timbres with matching characteristics, and then select one of the multiple timbres as the first timbre, and select one as the second timbre, first use the first timbre to play the dialogue content, and after receiving the trigger operation for the switch button, The first timbre is switched to the second timbre, that is, the dialogue content is played using the second timbre after switching.
这里,在将当前播放内容所对应的音色由第一音色切换为第二音色后,对于与当前播放内容属于同一角色的内容,均采用第二音色进行播放。Here, after the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre, the content belonging to the same role as the currently playing content is played using the second timbre.
在一些实施例中,在将当前播放内容所对应的音色由第一音色切换为第二音色后,还可以再次触发音色切换按键,当接收到针对音色切换按键的触发操作后,将第二音色切换为第三音色,其中,第一音色可以与第三音色相同,也可以不同。In some embodiments, after the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre, the timbre switching button may be triggered again, and after receiving the trigger operation for the timbre switching button, the second timbre Switch to the third timbre, where the first timbre can be the same as the third timbre, or it can be different.
在一些实施例中,终端在在通过语音播放文本内容的过程中,呈现针对文本内容中目标文本内容的推荐音色信息;其中,推荐音色信息,用于指示基于推荐音色信息,对目标文本内容所对应的角色的音色进行切换。In some embodiments, during the process of playing the text content through voice, the terminal presents recommended timbre information for the target text content in the text content; wherein the recommended timbre information is used to indicate, based on the recommended timbre information, what information about the target text content is to be determined. The timbre of the corresponding character is switched.
在实际实施时,可以为用户推荐音色,这里的目标文本内容可以是当前播放的文本内容,也可以是任一对应的角色的角色特征与推荐音色信息相匹配的文本内容。例如,根据当前播放的对话内容,获取与当前对话内容角色的角色特征相匹配的音色,基于匹配得到的音色,生成推荐音色信息,如,基于匹配度最高的音色,生成推荐音色信息;或者,当要推荐某一音色时,判断该文章中是否有与该音色相匹配的角色,若存在,则呈现相应的推荐音色信息。In actual implementation, a timbre may be recommended for the user, and the target text content here may be the currently playing text content, or may be the text content in which the character characteristics of any corresponding character match the recommended timbre information. For example, according to the currently playing dialogue content, obtain a timbre that matches the character characteristics of the character in the current dialogue content, and generate recommended timbre information based on the timbre obtained by matching, for example, based on the timbre with the highest degree of matching, generate recommended timbre information; or, When a certain timbre is to be recommended, it is judged whether there is a character matching the timbre in the article, and if there is, the corresponding recommended timbre information is presented.
作为示例,图10是本申请实施例提供的内容界面的示意图,参见图10,当识别到某一角色的角色特征与某一明星的音色相匹配时,呈现推荐音色信息1001,如“林xx的声音很匹配五师妹的声音”,以提示用户将五师妹的音色切换为林xx的音色。As an example, FIG. 10 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 10 , when it is recognized that the character characteristics of a certain character match the timbre of a certain star, recommended timbre information 1001 is presented, such as “Lin xx The voice of the fifth sister matches the voice of the fifth sister very well" to prompt the user to switch the tone of the fifth sister to the tone of Lin xx.
在一些实施例中,在呈现推荐音色信息的同时,呈现与该推荐音色信息相匹配的音色切换按键,当接收用户针对该音色切换按键的触发操作后,将相应对话内容对应的音色切换为推荐音色信息所指示的音色。In some embodiments, when the recommended timbre information is presented, a timbre switching button matching the recommended timbre information is presented, and after receiving the user's trigger operation on the timbre switching button, the timbre corresponding to the corresponding dialogue content is switched to the recommended timbre The Voice indicated by the Voice information.
作为示例,图11是本申请实施例提供的内容界面的示意图,参见图11,当文章中某一角色的角色特征与某一明星的音色相匹配时,呈现推荐音色信息1101,如“林xx 的声音很匹配五师妹的声音”,同时呈现音色切换按键1102,当用户点击音色切换按键1102时,采用林xx的声音播放五师妹对应的文本内容,如五师妹的对话内容。As an example, FIG. 11 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 11 , when the character characteristics of a character in the article match the timbre of a certain star, recommended timbre information 1101 is presented, such as “Lin xx The voice of xx matches the voice of the fifth junior sister very well", and the timbre switching button 1102 is displayed at the same time. When the user clicks the tone switching button 1102, the text content corresponding to the fifth junior sister is played using Lin xx's voice, such as the dialogue content of the fifth junior sister.
在一些实施例中,终端还可以当文本内容中存在对应环境描述信息的文本内容时,在对对应环境描述信息的文本内容进行播放时,将与环境描述信息相匹配的环境音乐作为背景音乐,并播放背景音乐。In some embodiments, when there is text content corresponding to the environment description information in the text content, when playing the text content corresponding to the environment description information, the terminal may use the environment music that matches the environment description information as the background music, and play background music.
在实际实施时,当文本内容中存在对应环境描述信息的文本内容时,获取文本内容中的环境描述信息,这里,可以预先设置环境描述信息的关键词典,其中,关键词典中存储有各种环境描述信息对应的关键词,然后将文本内容与关键词典中的关键词进行匹配,当文本内容中包含与关键词典中的关键词相匹配的文本内容时,确定存在对应环境描述信息的文本内容,并提取与关键词典中的关键词相匹配的文本内容,将该文本内容与各环境音乐进行匹配,以获取与环境描述信息相匹配的环境音乐。In actual implementation, when there is text content corresponding to the environment description information in the text content, the environment description information in the text content is obtained. Here, a key dictionary of the environment description information can be preset, and the key dictionary stores various environment description information. Describe the keywords corresponding to the information, and then match the text content with the keywords in the key dictionary. When the text content contains text content that matches the keywords in the key dictionary, it is determined that there is text content corresponding to the environmental description information, And extract the text content that matches the keywords in the key dictionary, and match the text content with each ambient music to obtain ambient music that matches the environment description information.
作为示例,文本内容包含的环境描述信息是一个下雨的夜晚时,可以获取与下雨相匹配的环境音乐,在对对应该环境描述信息的文本内容进行播放时,将与下雨相匹配的环境音乐作为背景音乐进行播放。As an example, when the environmental description information contained in the text content is a rainy night, ambient music that matches the rain can be obtained, and when the text content corresponding to the environmental description information is played, the music matching the rain can be obtained. Ambient music is played as background music.
本申请通过加入环境音乐作为背景音乐,使用户能够融入该文本内容所描述的场景中,进一步提升语音播放所带来的沉浸感。In the present application, by adding ambient music as background music, users can integrate into the scene described by the text content, and further enhance the immersion brought by voice playback.
在一些实施例中,终端还可以通过以下方式播放文本内容:确定文本内容中各语句对应的情感色彩;基于各语句对应的情感色彩,分别生成对应各语句的语音,以使语音携带相应的情感色彩;播放生成的对应各语句的语音。In some embodiments, the terminal can also play the text content in the following ways: determine the emotional color corresponding to each statement in the text content; Color; play the generated voice corresponding to each sentence.
在实际实施时,文本内容中的每个语句都有对应的情感色彩,特别是对于文本内容中的对话内容,文章中的角色在说话的时候都是带有情感色彩的,如悲伤、开心等。本申请通过获取每个语句对应的情感色彩,使生成的语音中携带有情感色彩,使得用户在听到语音时,能够有身临其境的感受。In actual implementation, each sentence in the text content has a corresponding emotional color, especially for the dialogue content in the text content, the characters in the text are emotional when they speak, such as sadness, happiness, etc. . In the present application, by acquiring the emotional color corresponding to each sentence, the generated voice carries emotional color, so that the user can have an immersive feeling when hearing the voice.
在实际应用中,每个语句对应的情感色彩,不仅仅是基于语句本身,还需要结合该语句的上下文,以提升情感色彩确定的准确性。例如,仅根据“这时候她却泪眼婆娑地说”仅仅能够判断出当前角色哭了,但无法判断该语句对应的情感色彩是喜极而泣还是悲伤而哭,这是就需要结合上下文进行判断的。In practical applications, the emotional color corresponding to each sentence is not only based on the sentence itself, but also needs to be combined with the context of the sentence to improve the accuracy of emotional color determination. For example, it is only possible to judge that the current character is crying based on "she said with tears at this time", but it is impossible to judge whether the emotional color corresponding to the sentence is crying with joy or crying, which needs to be judged in conjunction with the context.
在一些实施例中,终端可以通过以下方式确定文本内容中各语句对应的情感色彩:对文本内容中各语句进行情感标签提取,得到各语句对应的情感标签;采用提取的各语句对应的情感标签,表示相应的语句对应的情感色彩;终端可以通过以下方式基于各语句对应的情感色彩,分别生成对应各语句的语音:分别确定与各情感标签相匹配的语音参数,语音参数包括音质、音律中至少之一;基于各语音参数,生成各语句的语音。In some embodiments, the terminal may determine the emotional color corresponding to each sentence in the text content by: extracting the emotional label of each sentence in the text content to obtain the emotional label corresponding to each sentence; using the extracted emotional label corresponding to each sentence , indicating the emotional color corresponding to the corresponding sentence; the terminal can generate the voice corresponding to each sentence based on the emotional color corresponding to each sentence in the following ways: respectively determine the voice parameters that match each emotional label, and the voice parameters include sound quality, rhythm, etc. At least one; based on each speech parameter, the speech of each sentence is generated.
在实际实施时,由于语句对应的情感色彩不仅仅由文本信息决定,还会受到文章中角色所处的环境以及角色的基础信息的影响,这里的情感标签包括以下至少之一:基础信息、认知评价、心理感受。In actual implementation, since the emotional color corresponding to the sentence is not only determined by the text information, but also affected by the environment in which the character is located in the article and the basic information of the character, the emotional label here includes at least one of the following: basic information, recognition Knowledge evaluation, psychological feelings.
图12是本申请实施例提供的情感标签的示意图,参见图12,情感标签包括基础信息、认知评价和心理感受,其中,认知评价包括话语倾向性和话语样式,如话语倾向性可以是否定或肯定、冷漠或热情;基础信息包括年纪信息(如小朋友、年轻人等)、性别信息、身份信息(如霸道总裁);心理感受包括积极感受(如舒畅、同情等)和消极感受(如哀怨、惊恐)。FIG. 12 is a schematic diagram of an emotion tag provided by an embodiment of the present application. Referring to FIG. 12 , the emotion tag includes basic information, cognitive evaluation, and psychological feeling, wherein the cognitive evaluation includes discourse tendency and discourse style. For example, the discourse tendency may be Negative or affirmative, indifferent or enthusiastic; basic information includes age information (such as children, young people, etc.), gender information, identity information (such as domineering president); psychological feelings include positive feelings (such as comfort, sympathy, etc.) and negative feelings (such as grief, panic).
这里,获取到的一个语句的情感标签可以是一个或者多个,基于获取到情感标签之后,可以直接基于情感标签与语音参数之间的对应关系,来确定与情感标签相匹配的语音参数;也可以先基于多个情感标签进行情感预测,然后根据预测得到的情感与语音参数之间的对应关系,来获取与情感标签相匹配的语音参数。在获取到语音参数后,基于 语音参数生成对应语句的语音。Here, the acquired emotional tags of a sentence may be one or more. After the emotional tags are acquired, the voice parameters that match the emotional tags can be determined directly based on the correspondence between the emotional tags and the voice parameters; It is possible to first perform emotion prediction based on multiple emotion tags, and then obtain speech parameters matching the emotion tags according to the correspondence between the predicted emotions and speech parameters. After acquiring the speech parameters, the speech of the corresponding sentence is generated based on the speech parameters.
这里对情感参数进行说明。图13是本申请实施例提供到的语音参数的示意图,参见图13,语音参数包括音质和音律,其中,音质包括明亮度、饱和度等,音律包括音高、语速、音节间隔、节奏、语调等。The emotion parameter is explained here. FIG. 13 is a schematic diagram of speech parameters provided by an embodiment of the present application. Referring to FIG. 13 , the speech parameters include sound quality and rhythm, wherein the sound quality includes brightness, saturation, etc., and the rhythm includes pitch, speech rate, syllable interval, rhythm, intonation, etc.
图14是本申请实施例提供的情绪与语音参数对应关系的示意图,参见图14,不同的情感对应不同的语音参数,如情感为喜悦时,语速轻快,但有时较慢;情感为发怒时,语速稍快。Fig. 14 is a schematic diagram of the correspondence between emotions and speech parameters provided by an embodiment of the present application. Referring to Fig. 14, different emotions correspond to different speech parameters. For example, when the emotion is joy, the speech rate is brisk, but sometimes slower; when the emotion is anger , speak a little faster.
在一些实施例中,终端还可以呈现当播放至文本内容中的对话内容时,呈现卡通人物,并播放卡通人物采用音色朗读对话内容的动画;其中,卡通人物与对话内容所对应的角色的角色特征相匹配。In some embodiments, the terminal may also present a cartoon character when playing the dialogue content in the text content, and play an animation in which the cartoon character uses a timbre to read the dialogue content; wherein the cartoon character and the character corresponding to the dialogue content characteristics match.
在实际实施时,终端还可以根据对话内容所对应的角色的角色特征,获取与该角色特征相匹配的卡通人物,并且播放该卡通人物采用与角色特征的音色朗读该对话内容的动画,如此,用户能够同时从听觉和视觉融入文章所描述的场景,为用户带来更好的沉浸感。In actual implementation, the terminal can also obtain a cartoon character matching the character characteristic according to the character characteristic of the character corresponding to the dialogue content, and play an animation in which the cartoon character reads the dialogue content aloud using the timbre of the character characteristic, so, Users are able to integrate into the scene described in the article from both hearing and vision, bringing users a better sense of immersion.
作为示例,图15是本申请实施例提供的内容界面的示意图,参见图15,这里对话内容所对应的角色为一个小朋友,在内容界面呈现一个小朋友形象的卡通人物1501,播放该卡通人物1501以小朋友的音色朗读对话内容的动画。As an example, FIG. 15 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 15 , the character corresponding to the dialogue content here is a child, and a cartoon character 1501 in the image of a child is presented on the content interface, and the cartoon character 1501 is played to Animation of children's voice reading the content of the dialogue.
在一些实施例中,对于文本内容中的对话内容,采用与对话内容所对应的角色的角色特征相匹配的音色进行播放:从文章的内容中,提取对话内容所对应的角色的基础信息;获取与基础信息相适配的音色;采用获取的音色播放文本内容中的对话内容。In some embodiments, the dialogue content in the text content is played using a timbre that matches the character characteristics of the character corresponding to the dialogue content: from the content of the article, the basic information of the character corresponding to the dialogue content is extracted; The timbre that matches the basic information; the acquired timbre is used to play the dialogue content in the text content.
其中,基础信息包括以下至少之一:年龄信息、性别信息、身份信息。在实际实施时,从文章的内容中提取对话内容所对应的角色的基础信息,这里可以是从呈现的文本内容中提取,也可以从未呈现的文本内容中提取,可以理解的是,这里是结合文章中所有描述该角色的文本内容,提取对应该角色基础信息。The basic information includes at least one of the following: age information, gender information, and identity information. In actual implementation, the basic information of the role corresponding to the dialogue content is extracted from the content of the article, which can be extracted from the presented text content or from the unpresented text content. It is understandable that here is the Combine all the text content describing the role in the article to extract the basic information corresponding to the role.
在一些实施例中,终端还可以在通过语音播放文本内容的过程中,对当前播放的语句进行区别显示;随着语音播放的进行,滚动呈现文章的文本内容,以使呈现的文本内容与语音播放的进度相匹配。In some embodiments, during the process of playing the text content by voice, the terminal can also display the currently playing sentence differently; as the voice playing progresses, the text content of the article is scrolled and presented, so that the presented text content is different from the voice. match the progress of the playback.
在实际实施时,用户可以边听边看,也即在听语音播放文本内容的同时,可以浏览呈现的文本内容,为了提示用户播放的具体内容,可以对当前播放的语句进行区别显示,以使用户能够快速找到当前播放的语句。作为示例,图16是本申请实施例提供的内容界面的示意图,参加图16,采用灰色背景色,呈现当前播放的语句1601,以将其与其它语句相区别。In actual implementation, the user can listen and watch, that is, while listening to the voice to play the text content, he can browse the presented text content. In order to remind the user of the specific content to play, the currently playing sentence The user can quickly find the currently playing sentence. As an example, FIG. 16 is a schematic diagram of a content interface provided by an embodiment of the present application. Referring to FIG. 16 , a gray background color is used to present the currently playing sentence 1601 to distinguish it from other sentences.
这里,随着语音播放的进行,可以对文章的文本内容进行滚动呈现,以使当前播放的语句始终处于屏幕的中间位置。Here, as the voice playback progresses, the text content of the article can be scrolled and presented, so that the currently playing sentence is always in the middle of the screen.
在一些实施例中,终端还可以在通过语音播放文本内容的过程中,对当前播放的语句进行区别显示;随着语音播放的进行,翻页呈现文章的文本内容,以使呈现的文本内容与语音播放的进度相匹配。In some embodiments, during the process of playing the text content by voice, the terminal can also display the currently playing sentence differently; as the voice playing progresses, turn pages to present the text content of the article, so that the presented text content is different from the text content of the article. Match the progress of the voice playback.
在实际实施时,在当前呈现的文本内容播放完之后,可以进行翻页处理,呈现文章下一页的文本内容,并继续语音播放下一页的文本内容,以使呈现的文本内容与语音播放的进度相匹配。In actual implementation, after the currently presented text content is played, the page turning process can be performed to present the text content of the next page of the article, and continue to play the text content of the next page by voice, so that the presented text content and the voice play progress to match.
在一些实施例中,终端还可以从文章的内容中获取各角色的角色特征,将各角色的角色特征存储至区块链网络;如此,在其它终端需要通过语音播放该文章的文本内容时,可以直接从区块链获取文章中各角色的角色特征。In some embodiments, the terminal can also obtain the character characteristics of each character from the content of the article, and store the character characteristics of each character in the blockchain network; in this way, when other terminals need to play the text content of the article by voice, The character characteristics of each character in the article can be obtained directly from the blockchain.
这里,本申请实施例还可结合区块链技术,在终端获取各角色的角色特征获取各 角色的角色特征之后,生成用于存储各角色的角色特征的交易,提交生成的交易至区块链网络的节点,以使节点对交易共识后存储各角色的角色特征至区块链网络;在存储至区块链网络之前,终端还可对各角色的角色特征进行哈希处理得到对应各角色的角色特征的摘要信息;将得到的对应各角色的角色特征的摘要信息存储至区块链网络。通过上述方式,防止了各角色的角色特征被篡改,提高了各角色的角色特征的安全性。Here, the embodiment of the present application can also combine the blockchain technology, after the terminal obtains the character characteristics of each character and obtains the character characteristics of each character, generates a transaction for storing the character characteristics of each character, and submits the generated transaction to the blockchain. The node of the network, so that the node can store the role characteristics of each role to the blockchain network after consensus on the transaction; before storing to the blockchain network, the terminal can also hash the role characteristics of each role to obtain the corresponding role characteristics. Summary information of character characteristics; store the obtained summary information of character characteristics corresponding to each character to the blockchain network. In the above manner, the character characteristics of each character are prevented from being tampered with, and the security of the character characteristics of each character is improved.
参见图17,图17为本申请实施例提供的区块链网络的应用架构示意图,包括业务主体400、区块链网络600(示例性示出了共识节点610-1至共识节点610-3)、认证中心700,下面分别说明。Referring to FIG. 17, FIG. 17 is a schematic diagram of an application architecture of a blockchain network provided by an embodiment of the present application, including a business entity 400 and a blockchain network 600 (exemplarily showing a consensus node 610-1 to a consensus node 610-3) , and the authentication center 700, which will be described separately below.
区块链网络600的类型是灵活多样的,例如可以为公有链、私有链或联盟链中的任意一种。以公有链为例,任何业务主体的电子设备例如用户终端和服务器,都可以在不需要授权的情况下接入区块链网络600;以联盟链为例,业务主体在获得授权后其下辖的计算机设备(例如终端/服务器)可以接入区块链网络600,此时,如成为区块链网络600中的客户端节点。The type of the blockchain network 600 is flexible and diverse, for example, it can be any one of a public chain, a private chain or a consortium chain. Taking the public chain as an example, the electronic equipment of any business entity, such as user terminals and servers, can access the blockchain network 600 without authorization; taking the alliance chain as an example, the business entity will govern after obtaining authorization. The computer equipment (for example, a terminal/server) can access the blockchain network 600, and at this time, for example, become a client node in the blockchain network 600.
在一些实施例中,客户端节点可以只作为区块链网络600的观察者,即提供支持业务主体发起交易(例如,用于上链存储数据或查询链上数据)功能,对于区块链网络600的共识节点610的功能,例如排序功能、共识服务和账本功能等,客户端节点可以缺省或者有选择性(例如,取决于业务主体的具体业务需求)地实施。从而,可以将业务主体的数据和业务处理逻辑最大程度迁移到区块链网络600中,通过区块链网络600实现数据和业务处理过程的可信和可追溯。In some embodiments, the client node may only serve as an observer of the blockchain network 600, that is, provide the function of supporting business entities to initiate transactions (for example, for storing data on the chain or querying data on the chain), for the blockchain network The functions of the consensus node 610 of 600, such as ordering function, consensus service and ledger function, etc., can be implemented by the client node by default or selectively (eg, depending on the specific business needs of the business entity). Therefore, the data and business processing logic of the business subject can be migrated to the blockchain network 600 to the greatest extent, and the trustworthiness and traceability of the data and business processing process can be realized through the blockchain network 600 .
区块链网络600中的共识节点接收来自业务主体400的客户端节点提交的交易,执行交易以更新账本或者查询账本,执行交易的各种中间结果或最终结果可以返回业务主体的客户端节点中显示。The consensus node in the blockchain network 600 receives the transaction submitted by the client node of the business entity 400, executes the transaction to update the ledger or query the ledger, and various intermediate or final results of the executed transaction can be returned to the client node of the business entity. show.
例如,客户端节点410可以订阅区块链网络600中感兴趣的事件,例如区块链网络600中特定的组织/通道中发生的交易,由共识节点610推送相应的交易通知到客户端节点410,从而触发客户端节点410中相应的业务逻辑。For example, the client node 410 can subscribe to events of interest in the blockchain network 600, such as transactions occurring in a specific organization/channel in the blockchain network 600, and the consensus node 610 pushes corresponding transaction notifications to the client node 410 , thereby triggering the corresponding business logic in the client node 410 .
下面以业务主体接入区块链网络以实现文章的语音播放为例,说明区块链的示例性应用。The following describes an exemplary application of the blockchain by taking the business entity accessing the blockchain network to realize the voice playback of the article as an example.
参见图17,文章的语音播放涉及的业务主体400,从认证中心700进行登记注册获得数字证书,数字证书中包括业务主体的公钥、以及认证中心700对业务主体的公钥和身份信息签署的数字签名,用来与业务主体针对交易的数字签名一起附加到交易中,并被发送到区块链网络,以供区块链网络从交易中取出数字证书和签名,验证消息的可靠性(即是否未经篡改)和发送消息的业务主体的身份信息,区块链网络会根据身份进行验证,例如是否具有发起交易的权限。业务主体下辖的计算机设备(例如终端或者服务器)运行的客户端都可以向区块链网络600请求接入而成为客户端节点。Referring to FIG. 17 , the business subject 400 involved in the voice playback of the article registers and obtains a digital certificate from the certification center 700. The digital certificate includes the public key of the business subject, and the public key and identity information of the business subject signed by the certification center 700. The digital signature is used to attach to the transaction together with the digital signature of the business subject for the transaction, and is sent to the blockchain network for the blockchain network to extract the digital certificate and signature from the transaction to verify the reliability of the message (ie Whether it has not been tampered with) and the identity information of the business subject sending the message, the blockchain network will verify it according to the identity, such as whether it has the authority to initiate transactions. Clients running on computer equipment (such as terminals or servers) under the jurisdiction of the business entity can request access to the blockchain network 600 to become client nodes.
业务主体400客户端节点410用于通过语音播放文本内容,例如,在文章的内容界面中,呈现文章的文本内容以及对应文章的语音播放功能项;接收到基于语音播放功能项触发的针对文章的语音播放指令;响应于语音播放指令,通过语音播放文本内容;在通过语音播放文本内容的过程中,当文本内容包括至少一个角色时,对于与角色对应的文本内容,采用与角色的角色特征相匹配的音色进行播放。这里,终端会获取文章中各角色的角色特征,并将各角色的角色特征发送至区块链网络600。The client node 410 of the business entity 400 is used to play the text content by voice. For example, in the content interface of the article, the text content of the article and the voice playback function item of the corresponding article are presented; The voice play instruction; in response to the voice play instruction, the text content is played by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the character characteristics of the character are used. The matching Voice is played. Here, the terminal acquires the character characteristics of each character in the article, and sends the character characteristics of each character to the blockchain network 600 .
其中,将各角色的角色特征发送至区块链网络600的操作,可以预先在客户端节点410设置业务逻辑,当终端获取文章中各角色的角色特征时,客户端节点410将各角色的角色特征自动发送至区块链网络600,也可以由业务主体400的业务人员在客户端节点410中登录,手动打包各角色的角色特征,并将其发送至区块链网络600。在发送 时,客户端节点410根据各角色的角色特征生成对应存储操作的交易,在交易中指定了实现存储操作需要调用的智能合约、以及向智能合约传递的参数,交易还携带了客户端节点410的数字证书、签署的数字签名(例如,使用客户端节点410的数字证书中的私钥,对交易的摘要进行加密得到),并将交易广播到区块链网络600中的共识节点(如共识节点610-1、共识节点610-2、共识节点610-3)。Among them, the operation of sending the role characteristics of each role to the blockchain network 600 can set business logic in the client node 410 in advance. When the terminal obtains the role characteristics of each role in the article, the client node 410 sends the role of each role The features are automatically sent to the blockchain network 600 , or the business personnel of the business entity 400 can log in in the client node 410 , manually package the role features of each role, and send them to the blockchain network 600 . When sending, the client node 410 generates a transaction corresponding to the storage operation according to the role characteristics of each role, and specifies the smart contract to be called to realize the storage operation and the parameters passed to the smart contract in the transaction, and the transaction also carries the client node. 410, the signed digital signature (for example, obtained by encrypting the transaction digest using the private key in the digital certificate of the client node 410), and broadcasting the transaction to the consensus nodes in the blockchain network 600 (such as Consensus node 610-1, consensus node 610-2, consensus node 610-3).
区块链网络600中的共识节点中接收到交易时,对交易携带的数字证书和数字签名进行验证,验证成功后,根据交易中携带的业务主体400的身份,确认业务主体400是否是具有交易权限,数字签名和权限验证中的任何一个验证判断都将导致交易失败。验证成功后共识节点自己的数字签名(例如,使用共识节点610-1的私钥对交易的摘要进行加密得到),并继续在区块链网络600中广播。When the consensus node in the blockchain network 600 receives the transaction, it verifies the digital certificate and digital signature carried in the transaction. After the verification is successful, it is confirmed whether the business subject 400 has the transaction status according to the identity of the business subject 400 carried in the transaction. Any one of authority, digital signature and authority verification will cause the transaction to fail. After the verification is successful, the consensus node's own digital signature (for example, obtained by encrypting the transaction digest with the private key of the consensus node 610-1), continues to broadcast in the blockchain network 600.
区块链网络600中的共识节点接收到验证成功的交易后,将交易填充到新的区块中,并进行广播。区块链网络600中的共识节点广播的新区块时,会对新区块进行共识过程,如果共识成功,则将新区块追加到自身所存储的区块链的尾部,并根据交易的结果更新状态数据库,执行新区块中的交易:对于提交更新各角色的角色特征的交易,在状态数据库中添加各角色的角色特征。After receiving the successfully verified transaction, the consensus node in the blockchain network 600 fills the transaction into a new block and broadcasts it. When a consensus node in the blockchain network 600 broadcasts a new block, it will perform a consensus process on the new block. If the consensus is successful, the new block will be appended to the end of the blockchain stored by itself, and the status will be updated according to the transaction result. Database, execute transactions in the new block: For transactions that submit and update the character characteristics of each character, add the character characteristics of each character in the state database.
作为区块链示例,参见图18,图18为本申请实施例提供的区块链网络600中区块链的结构示意图,每个区块的头部既可以包括区块中所有交易的哈希值,同时也包含前一个区块中所有交易的哈希值,新产生的交易的记录被填充到区块并经过区块链网络中节点的共识后,会被追加到区块链的尾部从而形成链式的增长,区块之间基于哈希值的链式结构保证了区块中交易的防篡改和防伪造。As an example of a blockchain, see FIG. 18 , which is a schematic structural diagram of a blockchain in a blockchain network 600 provided by this embodiment of the application. The header of each block may include hashes of all transactions in the block. It also contains the hash value of all transactions in the previous block. The record of the newly generated transaction is filled into the block and after the consensus of the nodes in the blockchain network, it will be appended to the end of the blockchain. The chain-like growth is formed, and the chain-like structure based on the hash value between the blocks ensures the tamper-proof and anti-forgery of the transactions in the blocks.
下面说明本申请实施例提供的区块链网络的示例性的功能架构,参见图19,图19为本申请实施例提供的区块链网络600的功能架构示意图,区块链网络包括应用层601、共识层602、网络层603、数据层604和资源层605,下面分别进行说明。The following describes an exemplary functional architecture of the blockchain network provided by the embodiment of the present application. Referring to FIG. 19 , FIG. 19 is a schematic diagram of the functional architecture of the blockchain network 600 provided by the embodiment of the present application. The blockchain network includes an application layer 601 , the consensus layer 602, the network layer 603, the data layer 604 and the resource layer 605, which will be described separately below.
资源层605封装了实现区块链网络600中的各个共识节点的计算资源、存储资源和通信资源。The resource layer 605 encapsulates the computing resources, storage resources and communication resources for realizing each consensus node in the blockchain network 600 .
数据层604封装了实现账本的各种数据结构,包括以文件系统中的文件实现的区块链,键值型的状态数据库和存在性证明(例如区块中交易的哈希树)。The data layer 604 encapsulates various data structures that implement the ledger, including a blockchain implemented as files in a file system, a key-value state database, and proofs of existence (eg, a hash tree of transactions in blocks).
网络层603封装了点对点(P2P,Point to Point)网络协议、数据传播机制和数据验证机制、接入认证机制和业务主体身份管理的功能。The network layer 603 encapsulates the functions of point-to-point (P2P, Point to Point) network protocol, data dissemination mechanism and data verification mechanism, access authentication mechanism and business subject identity management.
其中,P2P网络协议实现区块链网络600中共识节点之间的通信,数据传播机制保证了交易在区块链网络600中的传播,数据验证机制用于基于加密学方法(例如数字证书、数字签名、公/私钥对)实现共识节点之间传输数据的可靠性;接入认证机制用于根据实际的业务场景对加入区块链网络600的业务主体的身份进行认证,并在认证通过时赋予业务主体接入区块链网络600的权限;业务主体身份管理用于存储允许接入区块链网络600的业务主体的身份、以及权限(例如能够发起的交易的类型)。Among them, the P2P network protocol realizes the communication between consensus nodes in the blockchain network 600, the data dissemination mechanism ensures the dissemination of transactions in the blockchain network 600, and the data verification mechanism is used based on cryptographic methods (such as digital certificates, digital signature, public/private key pair) to achieve the reliability of data transmission between consensus nodes; the access authentication mechanism is used to authenticate the identity of the business entity joining the blockchain network 600 according to the actual business scenario, and when the authentication is passed The business entity is given the permission to access the blockchain network 600; the business entity identity management is used to store the identity of the business entity allowed to access the blockchain network 600, as well as the permission (for example, the type of transaction that can be initiated).
共识层602封装了区块链网络600中的共识节点对区块达成一致性的机制(即共识机制)、交易管理和账本管理的功能。共识机制包括POS、POW和DPOS等共识算法,支持共识算法的可插拔。The consensus layer 602 encapsulates a mechanism (ie, a consensus mechanism) for consensus nodes in the blockchain network 600 to reach consensus on blocks, and functions of transaction management and ledger management. The consensus mechanism includes consensus algorithms such as POS, POW, and DPOS, and supports the pluggability of consensus algorithms.
交易管理用于验证共识节点接收到的交易中携带的数字签名,验证业务主体的身份信息,并根据身份信息判断确认其是否具有权限进行交易(从业务主体身份管理读取相关信息);对于获得接入区块链网络600的授权的业务主体而言,均拥有认证中心颁发的数字证书,业务主体利用自己的数字证书中的私钥对提交的交易进行签名,从而声明自己的合法身份。Transaction management is used to verify the digital signature carried in the transaction received by the consensus node, verify the identity information of the business entity, and determine whether it has the authority to conduct transactions according to the identity information (read relevant information from the business entity identity management); For authorized business entities accessing the blockchain network 600, they all have digital certificates issued by the certification center. The business entities use the private key in their digital certificates to sign the submitted transactions, thereby declaring their legal identity.
账本管理用于维护区块链和状态数据库。对于取得共识的区块,追加到区块链的 尾部;执行取得共识的区块中的交易,当交易包括更新操作时更新状态数据库中的键值对,当交易包括查询操作时查询状态数据库中的键值对并向业务主体的客户端节点返回查询结果。支持对状态数据库的多种维度的查询操作,包括:根据区块序列号(例如交易的哈希值)查询区块;根据区块哈希值查询区块;根据交易序列号查询区块;根据交易序列号查询交易;根据业务主体的账号(序列号)查询业务主体的账号数据;根据通道名称查询通道中的区块链。Ledger management is used to maintain the blockchain and state database. For the consensus block, append it to the end of the blockchain; execute the transaction in the consensus block, update the key-value pair in the state database when the transaction includes an update operation, and query the state database when the transaction includes a query operation and returns the query result to the client node of the business principal. Supports query operations in various dimensions of the state database, including: querying blocks according to block serial numbers (such as transaction hash values); querying blocks according to block hash values; querying blocks according to transaction serial numbers; Query transactions by transaction serial number; query the account data of the business entity according to the account number (serial number) of the business entity; query the blockchain in the channel according to the channel name.
应用层601封装了区块链网络能够实现的各种业务,包括交易的溯源、存证和验证等。The application layer 601 encapsulates various services that the blockchain network can implement, including transaction traceability, certificate storage, and verification.
应用上述实施例,通过在文章的内容界面中,呈现文章的文本内容以及对应文章的语音播放功能项;接收到基于语音播放功能项触发的针对文章的语音播放指令;响应于语音播放指令,通过语音播放文本内容;在通过语音播放文本内容的过程中,当文本内容包括至少一个角色时,对于与角色对应的文本内容,采用与角色相匹配的音色进行播放;如此,由于对文本内容进行播放时,所采用的音色是与该文本内容所对应的角色相匹配的,使得用户在听到播放的文本内容时能够声临其境,更能够沉浸到文章的内容中,提高了语音播放所带来的沉浸感。Applying the above embodiment, by presenting the text content of the article and the voice playback function item of the corresponding article in the content interface of the article; receiving a voice playback instruction for the article triggered based on the voice playback function item; in response to the voice playback instruction, by The text content is played by voice; in the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre matching the character is used to play; in this way, since the text content is played When playing, the timbre used matches the character corresponding to the text content, so that the user can be immersed in the situation when listening to the text content being played, and can be more immersed in the content of the article, which improves the effect of voice playback. immersion.
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。以与角色对应的文本内容为对话内容为例,在实际实施时,终端呈现文章的文本内容,用户对呈现的文本内容进行浏览,在浏览的过程中,可以启用听书功能,如用户点击播放功能项后,通过语音播放文章的文本内容;在播放过程中,当识别到文章里有对话内容时,获取与对话内容所对应的角色的角色特征的音色,采用与对话内容所对应的角色的角色特征的音色,生成对话内容的语音,并根据对话内容对应的情感色彩,在语音中增加情感色彩;当识别到文章中有环境描述信息时,对于包含环境描述信息的文本内容,在对应该文本内容的语音中,增加与环境描述信息相匹配的环境音乐作为背景音乐。Below, an exemplary application of the embodiments of the present application in a practical application scenario will be described. Taking the text content corresponding to the role as the dialogue content as an example, in actual implementation, the terminal presents the text content of the article, and the user browses the presented text content. During the browsing process, the listening function can be enabled, for example, the user clicks to play After the function item, the text content of the article is played by voice; during the playback process, when it is recognized that there is dialogue content in the article, the timbre of the character characteristics of the character corresponding to the dialogue content is obtained, and the voice of the character corresponding to the dialogue content is used. The timbre of the character features generates the voice of the dialogue content, and adds emotional color to the voice according to the emotional color corresponding to the dialogue content; when it is recognized that there is environmental description information in the article, for the text content containing the environmental description information, in the corresponding In the speech of the text content, ambient music that matches the environment description information is added as background music.
作为示例,参见图4-图6,在文章的内容界面中,呈现文章的文本内容401及对应文章的播放功能项402;当用户点击播放功能项402后,终端开始通过语音播放文章的文本内容,以悬浮形式呈现提示框501,并在提示框501中呈现文本提示信息“您收听的是智能识别听书”;当图5中的文本提示信息的呈现时长达到时长阈值时,将图5中的文本提示信息切换为图6中的播放图标61,同时收缩提示框,以使提示框的尺寸与提示框中的内容尺寸相适配。As an example, referring to FIGS. 4-6 , in the content interface of the article, the text content 401 of the article and the play function item 402 of the corresponding article are presented; when the user clicks the play function item 402, the terminal starts to play the text content of the article by voice , the prompt box 501 is presented in a floating form, and the text prompt information "You are listening to the intelligent recognition audiobook" is presented in the prompt box 501; when the presentation duration of the text prompt information in FIG. 5 reaches the duration threshold, the The text prompt information is switched to the play icon 61 in FIG. 6 , and the prompt box is shrunk so that the size of the prompt box matches the size of the content in the prompt box.
在实际应用中,用户可以自主为文章中的角色选择音色,也即,用户可根据自己的喜好自主选择音色。例如,首先,用户基于呈现的文本内容选择需要进行音色选取的角色,这里通过选取文本内容来选取角色,也即将选取的目标内容所对应的角色作为选取的角色。然后,在确定目标内容后,呈现对应该目标内容的至少两个音色选项;接着,用户基于呈现的至少两个音色选项,选择所要选择的音色。In practical applications, the user can independently select the timbre for the character in the article, that is, the user can independently select the timbre according to their own preferences. For example, first, the user selects a character whose timbre needs to be selected based on the presented text content. Here, the character is selected by selecting the text content, that is, the character corresponding to the selected target content is taken as the selected character. Then, after the target content is determined, at least two timbre options corresponding to the target content are presented; then, the user selects the timbre to be selected based on the presented at least two timbre options.
例如,参见图7,用户基于呈现的文本内容选择目标内容,这里可以通过点击文字来选择目标内容,也即当接收到用户的点击操作,将点击位置呈现的语句作为目标内容,并呈现一个浮层,在浮层中呈现至少两个音色选项701,这里音色选项以图文结合的形式进行呈现,也即呈现包含与音色相匹配的卡通人物的图像,并呈现音色的文本描述,如傻白甜型,用户可以基于呈现的音色选项进行音色选择。For example, referring to FIG. 7 , the user selects the target content based on the presented text content. Here, the target content can be selected by clicking on the text. That is, when the user's click operation is received, the sentence presented at the click position is used as the target content, and a floating layer, at least two timbre options 701 are presented in the floating layer, where the timbre options are presented in the form of a combination of graphics and text, that is, images containing cartoon characters matching the timbres are presented, and textual descriptions of the timbres are presented, such as silly white Sweet type, user can make timbre selection based on presented timbre options.
这里,在音色选择过程中,用户可以对各待选择的音色进行试听,也即用户用可以触发针对音色的试听操作,终端确定用户所要试听的音色,通过该所要试听的音色播放所选择的目标内容,实现音色的试听,如此,用户能够根据试听的语音来选择音色,更符合现实场景,提高了用户体验。Here, in the timbre selection process, the user can audition each timbre to be selected, that is, the user can trigger the audition operation for the timbre, the terminal determines the timbre that the user wants to audition, and plays the selected target through the timbre to be auditioned. content, to realize the audition of the timbre. In this way, the user can select the timbre according to the auditioned voice, which is more in line with the real scene and improves the user experience.
在一些实施例中,当终端识别到文章中某一角色的角色特征与推荐的音色相匹配 时,可以弹出浮层,并在浮层中呈现推荐音色信息,并呈现与该推荐音色信息相匹配的音色切换按键,当接收到用户针对该音色切换按键的触发操作时,即可将当前播放的对话内容对应的音色切换为推荐音色信息所指示的音色。In some embodiments, when the terminal recognizes that the character characteristics of a character in the article match the recommended timbre, a floating layer may pop up, and the recommended timbre information will be displayed in the floating layer, and the information that matches the recommended timbre will be displayed. When receiving the trigger operation of the timbre switching button by the user, the timbre corresponding to the currently playing dialogue content can be switched to the timbre indicated by the recommended timbre information.
例如,参见图11,当识别到文章中某一角色的角色特征与某一明星的音色相匹配时,呈现推荐音色信息1101,如“林xx的声音很匹配五师妹的声音”,同时呈现音色切换按键1102,当用户点击音色切换按键1102时,终端响应于该点击操作,将当前所采用的音色切换为林xx的音色,即切换后采用林xx的声音播放五师妹的对话内容。For example, referring to FIG. 11 , when it is recognized that the character characteristics of a character in the article match the timbre of a certain star, the recommended timbre information 1101 is presented, such as “Lin xx’s voice matches the voice of the fifth junior sister very well”, and the timbre is presented at the same time Switch button 1102, when the user clicks the tone switch button 1102, the terminal responds to the click operation and switches the currently used tone to Lin xx's tone, that is, Lin xx's voice is used to play the dialogue content of the fifth sister after the switch.
下面对本申请的技术实现过程进行说明。图20是本申请实施例提供的技术侧实现的流程示意图,参见图20,本申请实施例提供的文章的语音播放方法包括:The technical implementation process of the present application will be described below. FIG. 20 is a schematic flowchart of the technical side implementation provided by the embodiment of the present application. Referring to FIG. 20 , the voice playback method of the article provided by the embodiment of the present application includes:
步骤2001:终端对音频数据进行采集。Step 2001: The terminal collects audio data.
在实际实施时,终端首先启动录音,采集需要的音频数据,以构建情感语料库,这里,情感语料是进行情感语音合成研究的重要基础,在采集音频数据的过程中,需要对采集到的音频数据进行筛选,例如,启动录音后,终端对采集的音频数据进行分贝检测,若采集的音频数据中背景音嘈杂,则过滤掉采集的音频数据,并重新录音,直至筛选出满足要求(不存在音频质量问题)的音频数据。需要说明的是,录音可以是逐段录音的,通过逐段录音,采集到每个片段的录音对应的音频数据后,即可将采集到的音频数据上传至服务器进行检测,当检测到音频数据存在音频质量问题,则重新录音。In actual implementation, the terminal first starts recording and collects the required audio data to build an emotional corpus. Here, emotional corpus is an important basis for the research on emotional speech synthesis. In the process of collecting audio data, it is necessary to analyze the collected audio data. Screening, for example, after starting the recording, the terminal performs decibel detection on the collected audio data. If the background sound in the collected audio data is noisy, the collected audio data is filtered out and re-recorded until the screen meets the requirements (no audio is present). quality issues) audio data. It should be noted that the recording can be recorded segment by segment. After the audio data corresponding to the recording of each segment is collected, the collected audio data can be uploaded to the server for detection. When the audio data is detected If there is an audio quality problem, re-record.
在录音时,可录制不同场景下不同情感的语音,如陈述句、疑问句和感叹句等。录制完的音频数据我们需要通过praat工具进行标注,如对该音频数据的基频、音节边界、副语言学信息等进行标注,这些信息是为了后续在训练模型时,加入情感状态标签和情感关键词属性的标注信息。When recording, you can record voices with different emotions in different scenarios, such as declarative sentences, interrogative sentences and exclamation sentences. The recorded audio data needs to be marked by the praat tool, such as the fundamental frequency, syllable boundary, paralinguistic information, etc. of the audio data. These information are used to add emotional state labels and emotional keys when training the model later. Annotation information for word attributes.
作为示例,图21A是本申请实施例提供的基频点示意图,参见图21A,图中示出了“妈”跟“麻”的基频点的曲线图,其中,“妈”的声调为阴平,其对应的曲线为趋近于水平的曲线,“麻”的声调为阳平,对应的曲线为由下向上变化的曲线;图21B是本申请实施例提供的声调五度值图,参见图21B,该图与基频点示意图中曲线的走势相同。可以理解的是,就算没有语音,也可以根据基频点和声调五度值图得知什么时候该发音“妈”,什么时候该发音“麻”。As an example, FIG. 21A is a schematic diagram of a fundamental frequency point provided by an embodiment of the present application. Referring to FIG. 21A, the figure shows a graph of the fundamental frequency points of “Ma” and “Ma”, wherein the tone of “Ma” is Yin-ping , its corresponding curve is a curve that is close to the level, the tone of "ma" is positive, and the corresponding curve is a curve that changes from bottom to top; Figure 21B is a five-degree value diagram of the tone provided by the embodiment of the application, see Figure 21B , the graph is the same as the curve in the fundamental frequency point diagram. It is understandable that even if there is no speech, it is possible to know when to pronounce "mama" and when to pronounce "ma" according to the fundamental frequency point and the tone fifth value map.
步骤2002:对声学模型进行训练。Step 2002: Train the acoustic model.
终端在获取到音频数据后,对音频数据进行预处理,这里的预处理包括预加重、分帧等处理,这些操作的目的是消除人类发声器官本身和由于采集语音信号的设备所带来的混叠、失真等因素,为了使后续的语音处理得到的信号更加均匀、平滑,为信号参数的提取提供优质的参数,以提高语音处理质量。在预处理完成后,将预处理后的音频数据入库,以基于入库的音频数据训练声学模型,例如让声学模型学习到每个发音到底怎么发、以及音色特征,以得到需要的声学模型。After the terminal obtains the audio data, it preprocesses the audio data. The preprocessing here includes pre-emphasis, framing and other processing. The purpose of these operations is to eliminate the confusion caused by the human vocal organ itself and the device that collects the voice signal. In order to make the signal obtained by subsequent speech processing more uniform and smooth, it provides high-quality parameters for the extraction of signal parameters to improve the quality of speech processing. After the preprocessing is completed, the preprocessed audio data is stored in the database, and the acoustic model is trained based on the stored audio data. For example, the acoustic model can learn how to pronounce each pronunciation and the timbre characteristics, so as to obtain the required acoustic model. .
为实现在语音中增加情感色彩,可以训练一个声学模型。在训练声学模型时,首先对音频数据进行声学分析。这里,由于汉语的韵律多以音节为处理单位,在这种有调音节的韵律分析中,音节的韵律特征起着非常重要的作用,可将语音参数分为:音质和音律。其中,音质可以包括明亮度、饱和度;音律包括音高、语速、音节间隔等。例如,人在表现兴奋时,说话语速会快,音强高,并可能带有一定的呼吸声。如此,可以获取基本情感色彩下的基频参数、谱参数等信息。To add emotional color to speech, an acoustic model can be trained. When training an acoustic model, the audio data is first subjected to acoustic analysis. Here, since the prosody of Chinese is mostly processed by syllables, the prosodic features of syllables play a very important role in the prosody analysis of toned syllables, and the speech parameters can be divided into: tone quality and rhythm. Wherein, the sound quality may include brightness and saturation; the temperament may include pitch, speech rate, syllable interval, and the like. For example, when a person expresses excitement, he speaks at a fast rate, with a high pitch, and may have a certain breathing sound. In this way, information such as fundamental frequency parameters and spectral parameters under the basic emotional color can be obtained.
然后对声学模型进行训练,这里的声学模型采用隐马尔科夫模型(HMM,Hidden Markov Model),图22是申请实施例提供的声学模型训练流程示意图,参见图22,对语音语料中的语音信号进行基频参数提取得到基频参数,以及对语音语料中的语音信号进行谱参数提取,然后根据基频参数和谱参数对隐马尔科夫模型来进行训练。这里的语音 语料是基于上述入库的音频数据构建的。Then the acoustic model is trained, and the acoustic model here adopts a Hidden Markov Model (HMM, Hidden Markov Model). Figure 22 is a schematic diagram of the training process of the acoustic model provided by the application embodiment. The fundamental frequency parameter is extracted to obtain the fundamental frequency parameter, and the spectral parameter is extracted for the speech signal in the speech corpus, and then the hidden Markov model is trained according to the fundamental frequency parameter and the spectral parameter. The speech corpus here is constructed based on the above-mentioned audio data stored in the library.
这里的谱参数和基频参数的作用是为了使合成的语句更加平滑和自然,谱参数由美尔倒谱参数(MFCC,Mel Frequency Ceptrum Coefficient)及其一阶二阶delta系数表示,基频参数由基频F0及其一阶二阶delta系数表示。The function of the spectral parameters and fundamental frequency parameters here is to make the synthesized sentences more smooth and natural. The spectral parameters are represented by the Mel Frequency Ceptrum Coefficient (MFCC, Mel Frequency Ceptrum Coefficient) and its first-order and second-order delta coefficients. The fundamental frequency parameters are represented by The fundamental frequency F0 and its first-order second-order delta coefficients are represented.
美尔倒谱系数是一种经典的语音特征,它是基于人耳听觉域特性提取的特征参数,是对人耳听觉特征的工程化模拟。人的听觉感知除了音调高低的感知之外,还包括响度的感知,人耳对响度的感知与声音频带有关,将语音信号的频谱变换到感知频域,能更好的模拟人耳听觉过程。美尔频率的意义为1Mel为1000Hz的音调感知程度的1/1000。而基频F0则为滤波器应用范围的最低频率。The Mel cepstral coefficient is a classic speech feature, which is a feature parameter extracted based on the characteristics of the human auditory domain, and is an engineering simulation of the human auditory characteristics. In addition to the perception of pitch, human auditory perception also includes the perception of loudness. The human ear's perception of loudness is related to the sound frequency band. Transforming the spectrum of the speech signal into the perceptual frequency domain can better simulate the human hearing process. The meaning of Mel frequency is that 1 Mel is 1/1000 of the degree of pitch perception at 1000 Hz. The fundamental frequency F0 is the lowest frequency of the filter application range.
步骤2003:合成音频。Step 2003: Synthesize audio.
在实际实施时,首先输入文章的文本,对文章的文本进行预处理,先给文本分词,把文本转换成由词组成的语句,再给这个语句标注音素级别、音节级别、单词级别等对语音合成有帮助的信息。In actual implementation, first input the text of the article, preprocess the text of the article, first segment the text, convert the text into a sentence composed of words, and then label the sentence at the phoneme level, syllable level, and word level. Synthesize helpful information.
这里需要对文本进行逐级分析,如词、句、章、书逐级分析,这里的采用词频-逆文本频率指数(TF-IDF,Term Frequency–Inverse Document Frequency)算法结合n-gram(文本中连续出现的n个gram,gram为我们通过特定的阈值设定过滤好的词),进行关键词提取;将提取出的关键词与关键字词典中的词进行文本相似度分析,以从关键字词典中筛选出与情感标签相关的关键词,例如性格,情绪,场景,性别等。Here, the text needs to be analyzed step by step, such as word, sentence, chapter, book. There are n consecutive grams, gram is the word that we have filtered through a specific threshold), and keyword extraction is performed; Keywords related to emotional tags, such as character, mood, scene, gender, etc., are filtered out of the dictionary.
图23是本申请实施例提供的关键字词典的构建过程示意图,参见图23,首先构建大规模文本语料库,以训练词向量模型;并采集数据,如站内小说、用户分类、小说标签、通用数据库等,由于小说标签、通用数据库是已经通过筛选的,根据小说标签、通用数据库构建种子词典;接着,基于词向量模型和种子词典进行模型训练,以基于训练得到的模型,预测新词;将预测到的新词加入关键字词典,以构建关键字词典。FIG. 23 is a schematic diagram of the construction process of the keyword dictionary provided by the embodiment of the present application. Referring to FIG. 23, a large-scale text corpus is first constructed to train a word vector model; etc., since the novel tags and general databases have been screened, a seed dictionary is constructed based on the novel tags and general databases; then, model training is performed based on the word vector model and the seed dictionary to predict new words based on the model obtained by training; The obtained new words are added to the keyword dictionary to construct the keyword dictionary.
进一步,可以通过情感分类模型,基于文章中角色的性格进行情感分类,图24是本申请实施例提供的基于性格的情感分类模型的示意图,可以通过以下方式提取与文中角色性格相关的情感标签:通过Word2Vec(训练词向量模型的工具)得到文本中词的词向量表示,进而得到段落或章节中的词向量矩阵,将该词向量矩阵输入基于角色性格的文本分析器2401,以获取不同类型的文本组,将不同类型的文本组输入相应类型的分类器2402,最后对各分类器的输出结果进行融合,得到最终的分类结果。其中,C、A、E分别指外向性、愉悦性和责任型三个维度,H、L分别指各维度性格值的高低,例如HA表示高愉悦性、HC表示更外向、LE表示低责任型等。Further, emotion classification can be carried out based on the character of the character in the article through the emotion classification model. Figure 24 is a schematic diagram of the character-based emotion classification model provided by the embodiment of the present application, and the emotion label related to the character of the character in the article can be extracted in the following manner: The word vector representation of the words in the text is obtained by Word2Vec (a tool for training word vector models), and then the word vector matrix in the paragraph or chapter is obtained, and the word vector matrix is input into the character-based text analyzer 2401 to obtain different types of For text groups, different types of text groups are input into corresponding types of classifiers 2402, and finally the output results of each classifier are fused to obtain the final classification result. Among them, C, A, and E refer to the three dimensions of extroversion, pleasantness and responsibility, respectively, and H and L respectively refer to the level of the personality value of each dimension. For example, HA means high agreeableness, HC means more extroversion, and LE means low responsibility type. Wait.
通过上述的算法,就能得到语音合成所需要的情感标签,如小说标签、基础信息(性格、身份、年龄、性别)、认知评价(环境、情绪)。然后,基于这些情感标签进行情感预测,以预测人在说出相应语句时所附带的情感色彩。Through the above algorithm, emotional tags required for speech synthesis can be obtained, such as novel tags, basic information (character, identity, age, gender), and cognitive evaluation (environment, emotion). Then, based on these emotional labels, emotion prediction is performed to predict the emotional color attached to the person when they say the corresponding sentence.
由于情感色彩不单单是由文本信息来决定,还受到文章中角色所处的环境以及身份地位等信息的影响。基于此,本申请是从文本的上下文中,推测出角色的情感色彩,从而能够顺利合成正确的语音。例如“这时候她却泪眼婆娑地说”,这时候我们得预测她的情感色彩是喜极而泣还是悲伤而哭。Because emotional color is not only determined by text information, but also affected by the environment and status of the characters in the article. Based on this, the present application infers the emotional color of the character from the context of the text, so that the correct speech can be synthesized smoothly. For example, "she said in tears at this time", at this time we have to predict whether her emotional color is crying with joy or crying with sadness.
在预测到情感色彩之后,结合情感色彩,合成音频。这里,合成携带有情感色彩的语音的关键在于获取基频参数,人之所以能从语音中辨别出不同的情感色彩,是因为语音包含有能体现情感的基频参数的差异来体现。图25是本申请实施例提供的合成音频的流程示意图,参见图25,合成音频的流程包括:After the emotional color is predicted, the audio is synthesized by combining the emotional color. Here, the key to synthesizing speech with emotional colors is to obtain fundamental frequency parameters. The reason why people can distinguish different emotional colors from speech is that speech contains differences in fundamental frequency parameters that can reflect emotions. FIG. 25 is a schematic flowchart of a synthesized audio provided by an embodiment of the present application. Referring to FIG. 25 , the process of synthesizing audio includes:
步骤2501:解析文本。Step 2501: Parse the text.
这里,解析文本包括语法解析和语义解析,其中,语法解析包括词性标注、词语 解析、发音解析。Here, parsing text includes syntactic parsing and semantic parsing, wherein syntactic parsing includes part-of-speech tagging, word parsing, and pronunciation parsing.
步骤2502:情感标签提取。Step 2502: Emotion tag extraction.
这里,提取的情感标签包括小说标签、基础信息(性格、身份、年龄、性别)、认知评价(环境、情绪)。Here, the extracted emotional tags include novel tags, basic information (character, identity, age, gender), and cognitive evaluation (environment, emotion).
步骤2503:对语音进行标注。Step 2503: Label the speech.
在实际实施时,通过提取的情感标签对语音进行标注。这里,标注的逻辑与训练声学模型时一样,也即,调整基频参数等信息。在实际实施时,获取HMM模型输出的基频参数,基于情感标签对HMM模型输出的基频参数进行调整,以得到最终的基频参数。In actual implementation, the speech is annotated by the extracted emotional labels. Here, the labeling logic is the same as when training the acoustic model, that is, adjusting the fundamental frequency parameters and other information. In actual implementation, the fundamental frequency parameters output by the HMM model are obtained, and the fundamental frequency parameters output by the HMM model are adjusted based on the emotion labels to obtain the final fundamental frequency parameters.
步骤2504:合成音频。Step 2504: Synthesize audio.
通过合成滤波器,基于基频参数和HMM模型输出的谱参数合成音频。Through synthesis filters, audio is synthesized based on fundamental frequency parameters and spectral parameters output by the HMM model.
应用上述实施例,能够让用户在听书的过程中能身临其境,更能沉浸式的进入到小说的场景中去,从而能提升用户使用体验感受和使用时长。The application of the above-mentioned embodiment enables the user to be immersed in the scene while listening to the book, and can more immersely enter the scene of the novel, thereby improving the user experience and usage time.
下面继续说明本申请实施例提供的文章的语音播放装置555的实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器550的文章的语音播放装置555中的软件模块可以包括:The following will continue to describe the exemplary structure of the voice playback device 555 of the article provided by the embodiment of the present application implemented as a software module. In some embodiments, as shown in FIG. Software modules can include:
呈现模块5551,配置为在文章的内容界面中,呈现文章的文本内容以及对应所述文章的语音播放功能项;The presentation module 5551 is configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;
接收模块5552,配置为接收到基于所述语音播放功能项触发的针对所述文章的语音播放指令;A receiving module 5552, configured to receive a voice play instruction for the article triggered based on the voice play function item;
第一播放模块5553,配置为响应于所述语音播放指令,通过语音播放所述文本内容;The first playing module 5553 is configured to play the text content by voice in response to the voice play instruction;
第二播放模块5554,配置为在通过语音播放所述文本内容的过程中,当所述文本内容包括至少一个角色时,对于与所述角色对应的文本内容,采用与所述角色的相匹配的音色进行播放。The second playing module 5554 is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, use the text content that matches the character. sound to play.
在一些实施例中,所述呈现模块,还配置为在通过语音播放所述文本内容的过程中,以悬浮形式呈现提示框,并In some embodiments, the presentation module is further configured to present a prompt box in a floating form during the process of playing the text content by voice, and
在所述提示框中呈现文本提示信息;Present text prompt information in the prompt box;
其中,所述文本提示信息,用于提示正在通过语音播放所述文本内容。The text prompt information is used to prompt that the text content is being played by voice.
在一些实施例中,所述呈现模块,还配置为当所述文本提示信息的呈现时长达到时长阈值时,收缩所述提示框,并In some embodiments, the presentation module is further configured to shrink the prompt box when the presentation duration of the text prompt information reaches a duration threshold, and
将所述提示框中的文本提示信息切换为播放图标;其中,所述播放图标,用于指示正在通过语音播放所述文本内容。The text prompt information in the prompt box is switched to a play icon; wherein the play icon is used to indicate that the text content is being played by voice.
在一些实施例中,所述第二播放模块,还配置为响应于针对所述文本内容中目标内容的选定操作,呈现对应所述目标内容的至少两个音色选项;其中,每个所述音色选项对应一种音色;In some embodiments, the second playing module is further configured to, in response to a selected operation on target content in the text content, present at least two timbre options corresponding to the target content; wherein each of the The timbre option corresponds to a timbre;
响应于基于所述至少两个音色选项触发的音色选取操作,将所选取的目标音色作为所述目标内容所对应的角色的音色,以In response to the timbre selection operation triggered based on the at least two timbre options, the selected target timbre is used as the timbre of the character corresponding to the target content, to
在通过语音播放所述文本内容的过程中,对于所述目标内容所对应的角色对应的文本内容,采用目标音色进行播放。In the process of playing the text content by voice, the text content corresponding to the character corresponding to the target content is played using the target timbre.
在一些实施例中,所述第一播放模块,还配置为呈现所述至少两个音色的试听功能项;In some embodiments, the first playing module is further configured to present the audition function items of the at least two timbres;
响应于针对目标音色对应的试听功能项的触发操作,采用所述试听功能项对应的所述目标音色播放所述目标内容。In response to the triggering operation for the audition function item corresponding to the target tone color, the target content is played by using the target tone color corresponding to the audition function item.
在一些实施例中,所述第一播放模块,还配置为在所述文章的内容界面中,呈现音 色选取功能项;In some embodiments, the first playback module is also configured to present a timbre selection function item in the content interface of the article;
响应于针对所述音色选取功能项的触发操作,呈现所述文章中的至少两个角色;In response to a trigger operation for the timbre selection function item, present at least two characters in the article;
响应于针对所述至少两个角色中目标角色的选取操作,呈现与所述目标角色对应的至少两个音色;In response to a selection operation for a target character among the at least two characters, presenting at least two timbres corresponding to the target character;
响应于基于所述至少两个音色触发的音色选取操作,将所选取的目标音色作为所述目标角色的音色,以In response to the timbre selection operation triggered based on the at least two timbres, the selected target timbre is used as the timbre of the target character, to
在通过语音播放所述文本内容的过程中,对于所述目标角色对应的文本内容,采用所述目标音色进行播放。During the process of playing the text content by voice, the text content corresponding to the target character is played using the target timbre.
在一些实施例中,所述第一播放模块,还配置为在通过语音播放所述文本内容的过程中,呈现针对所述文本内容的音色切换按键;In some embodiments, the first playing module is further configured to present a tone switching button for the text content during the process of playing the text content by voice;
当接收到针对所述音色切换按键的触发操作时,将所述文本内容对应的音色由第一音色切换为第二音色。When receiving a trigger operation for the timbre switching button, the timbre corresponding to the text content is switched from the first timbre to the second timbre.
在一些实施例中,所述第一播放模块,还配置为在通过语音播放所述文本内容的过程中,当播放至所述文本内容中的对话内容时,呈现针对所述文本内容中目标文本内容的推荐音色信息;In some embodiments, the first playing module is further configured to, during the process of playing the text content by voice, when playing the dialogue content in the text content, present the target text in the text content Recommended tone information for the content;
其中,所述推荐音色信息,用于指示基于所述推荐音色信息,对所述目标文本内容所对应的角色的音色进行切换。The recommended timbre information is used to instruct to switch the timbre of the character corresponding to the target text content based on the recommended timbre information.
在一些实施例中,所述第一播放模块,还配置为当所述文本内容中存在对应环境描述信息的文本内容时,在对所述对应环境描述信息的文本内容进行播放时,将与所述环境描述信息相匹配的环境音乐作为背景音乐,并播放所述背景音乐。In some embodiments, the first playing module is further configured to, when there is text content corresponding to the environment description information in the text content, when the text content corresponding to the environment description information is played, the text content corresponding to the environment description information is played. The ambient music that matches the environment description information is used as the background music, and the background music is played.
在一些实施例中,所述第一播放模块,还配置为确定所述文本内容中各语句对应的情感色彩;In some embodiments, the first playing module is further configured to determine the emotional color corresponding to each sentence in the text content;
基于各语句对应的情感色彩,分别生成对应各所述语句的语音,以使所述语音携带相应的情感色彩;Based on the emotional color corresponding to each sentence, the voice corresponding to each of the sentences is respectively generated, so that the voice carries the corresponding emotional color;
播放生成的对应各所述语句的语音。Play the generated voice corresponding to each of the sentences.
在一些实施例中,所述第一播放模块,还配置为对所述文本内容中各语句进行情感标签提取,得到各所述语句对应的情感标签,所述情感标签包括以下至少之一:基础信息、认知评价、心理感受;In some embodiments, the first playback module is further configured to perform emotional tag extraction on each sentence in the text content to obtain emotional tags corresponding to each of the sentences, where the emotional tags include at least one of the following: basic information, cognitive evaluation, psychological feelings;
采用提取的各所述语句对应的所述情感标签,表示相应的所述语句对应的情感色彩;Using the emotion labels corresponding to the extracted sentences to represent the emotional colors corresponding to the corresponding sentences;
分别确定与各所述情感标签相匹配的语音参数,所述语音参数包括音质、音律中至少之一;Respectively determine the voice parameters that match each of the emotional tags, and the voice parameters include at least one of sound quality and melody;
基于各所述语音参数,生成各所述语句的语音。Based on each of the speech parameters, the speech of each of the sentences is generated.
在一些实施例中,所述第一播放模块,还配置为当播放至所述文本内容中的对话内容时,呈现卡通人物,并播放所述卡通人物采用所述音色朗读所述对话内容的动画;In some embodiments, the first playing module is further configured to present a cartoon character when playing the dialogue content in the text content, and play an animation in which the cartoon character uses the timbre to read the dialogue content aloud ;
其中,所述卡通人物与所述对话内容所述角色的角色特征相匹配。Wherein, the cartoon characters match the character characteristics of the characters in the dialogue content.
在一些实施例中,所述第一播放模块,还配置为从所述文章的内容中,提取所述对话内容所对应的角色的画像信息;In some embodiments, the first playback module is further configured to extract, from the content of the article, portrait information of the character corresponding to the content of the dialogue;
获取与所述画像信息相适配的音色;Obtain the timbre that matches the profile information;
采用获取的与所述画像信息相适配的音色,播放所述文本内容中的对话内容。The dialogue content in the text content is played by using the acquired timbre adapted to the portrait information.
在一些实施例中,所述第一播放模块,还配置为在通过语音播放所述文本内容的过程中,对当前播放的语句进行区别显示;In some embodiments, the first playing module is further configured to differentiate and display the currently playing sentences during the process of playing the text content by voice;
随着语音播放的进行,滚动呈现所述文章的文本内容,以使呈现的文本内容与语音播放的进度相匹配。As the voice playback progresses, the text content of the article is scrolled and presented so that the presented text content matches the progress of the voice playback.
在一些实施例中,所述第一播放模块,还配置为在通过语音播放所述文本内容的过 程中,对当前播放的语句进行区别显示;In some embodiments, the first playback module is also configured to display the currently played sentences differently in the process of playing the text content by voice;
随着语音播放的进行,翻页呈现所述文章的文本内容,以使呈现的文本内容与语音播放的进度相匹配。As the voice playback progresses, the text content of the article is presented by page turning, so that the presented text content matches the progress of the voice playback.
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的文章的语音播放方法。Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the voice playback method of the above-mentioned article in the embodiment of the present application.
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图3示出的方法。The embodiments of the present application provide a computer-readable storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application, for example , as shown in Figure 3.
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; it may also include one or any combination of the foregoing memories Various equipment.
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。In some embodiments, executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。As an example, executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document One or more scripts in , stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。As an example, executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. Any modifications, equivalent replacements and improvements made within the spirit and scope of this application are included within the protection scope of this application.

Claims (16)

  1. 一种文章的语音播放方法,所述方法由计算机设备执行,所述方法包括:A voice playback method of an article, the method is executed by a computer device, and the method comprises:
    在文章的内容界面中,呈现文章的文本内容以及对应所述文章的语音播放功能项;In the content interface of the article, the text content of the article and the voice playback function item corresponding to the article are presented;
    接收到基于所述语音播放功能项触发的针对所述文章的语音播放指令;Receive a voice play instruction for the article triggered based on the voice play function item;
    响应于所述语音播放指令,通过语音播放所述文本内容;In response to the voice play instruction, play the text content by voice;
    在通过语音播放所述文本内容的过程中,当所述文本内容包括至少一个角色时,对于与所述角色对应的文本内容,采用与所述角色的角色特征相匹配的音色进行播放。During the process of playing the text content by voice, when the text content includes at least one character, for the text content corresponding to the character, the timbre matching the character characteristics of the character is used for playing.
  2. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    在通过语音播放所述文本内容的过程中,以悬浮形式呈现提示框,并在所述提示框中呈现文本提示信息;During the process of playing the text content by voice, a prompt box is presented in a floating form, and text prompt information is presented in the prompt box;
    其中,所述文本提示信息,用于提示正在通过语音播放所述文本内容。The text prompt information is used to prompt that the text content is being played by voice.
  3. 如权利要求2所述的方法,其中,所述在所述提示框中呈现文本提示信息之后,所述方法还包括:The method of claim 2, wherein after the text prompt information is presented in the prompt box, the method further comprises:
    当所述文本提示信息的呈现时长达到时长阈值时,收缩所述提示框,并将所述提示框中的文本提示信息切换为播放图标;When the presentation duration of the text prompt information reaches the duration threshold, shrink the prompt box, and switch the text prompt information in the prompt box to a play icon;
    其中,所述播放图标,用于指示正在通过语音播放所述文本内容。The play icon is used to indicate that the text content is being played by voice.
  4. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    响应于针对所述文本内容中目标内容的选定操作,呈现对应所述目标内容的至少两个音色选项;其中,每个所述音色选项对应一种音色;In response to the selection operation for the target content in the text content, present at least two timbre options corresponding to the target content; wherein each of the timbre options corresponds to a timbre;
    响应于基于所述至少两个音色选项触发的音色选取操作,将所选取的目标音色作为所述目标内容所对应的角色的音色,以In response to the timbre selection operation triggered based on the at least two timbre options, the selected target timbre is used as the timbre of the character corresponding to the target content, to
    在通过语音播放所述文本内容的过程中,对于所述目标内容所对应的角色对应的文本内容,采用目标音色进行播放。In the process of playing the text content by voice, the text content corresponding to the character corresponding to the target content is played using the target timbre.
  5. 如权利要求4所述的方法,其中,所述呈现对应所述目标内容的至少两个音色选项之后,所述方法还包括:The method of claim 4, wherein after the presenting at least two timbre options corresponding to the target content, the method further comprises:
    呈现所述至少两个音色的试听功能项;presenting the audition function items of the at least two timbres;
    响应于针对目标音色对应的试听功能项的触发操作,采用所述试听功能项对应的所述目标音色播放所述目标内容。In response to the triggering operation for the audition function item corresponding to the target tone color, the target content is played by using the target tone color corresponding to the audition function item.
  6. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    在所述文章的内容界面中,呈现音色选取功能项;In the content interface of the article, the timbre selection function item is presented;
    响应于针对所述音色选取功能项的触发操作,呈现所述文章中的至少两个角色;In response to a trigger operation for the timbre selection function item, present at least two characters in the article;
    响应于针对所述至少两个角色中目标角色的选取操作,呈现与所述目标角色对应的至少两个音色;In response to a selection operation for a target character among the at least two characters, presenting at least two timbres corresponding to the target character;
    响应于基于所述至少两个音色触发的音色选取操作,将所选取的目标音色作为所述目标角色的音色,以In response to the timbre selection operation triggered based on the at least two timbres, the selected target timbre is used as the timbre of the target character, to
    在通过语音播放所述文本内容的过程中,对于所述目标角色对应的文本内容,采用所述目标音色进行播放。During the process of playing the text content by voice, the text content corresponding to the target character is played using the target timbre.
  7. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    在通过语音播放所述文本内容的过程中,呈现针对所述文本内容的音色切换按键;During the process of playing the text content by voice, presenting a timbre switching key for the text content;
    当接收到针对所述音色切换按键的触发操作时,将当前播放内容所对应的音色由第一音色切换为第二音色。When receiving a trigger operation for the timbre switching button, the timbre corresponding to the currently playing content is switched from the first timbre to the second timbre.
  8. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    在通过语音播放所述文本内容的过程中,呈现针对所述文本内容中目标文本内容的推荐音色信息;In the process of playing the text content by voice, presenting the recommended timbre information for the target text content in the text content;
    其中,所述推荐音色信息,用于指示基于所述推荐音色信息,对所述目标文本内容所对 应的角色的音色进行切换。The recommended timbre information is used to instruct to switch the timbre of the character corresponding to the target text content based on the recommended timbre information.
  9. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    当所述文本内容中存在对应环境描述信息的文本内容时,在对所述对应环境描述信息的文本内容进行播放时,将与所述环境描述信息相匹配的环境音乐作为背景音乐,并播放所述背景音乐。When there is text content corresponding to the environmental description information in the text content, when the text content corresponding to the environmental description information is played, the environmental music that matches the environmental description information is used as the background music, and all background music.
  10. 如权利要求1所述的方法,其中,所述通过语音播放所述文本内容,包括:The method of claim 1, wherein the playing the text content by voice comprises:
    确定所述文本内容中各语句对应的情感色彩;determining the emotional color corresponding to each sentence in the text content;
    基于各语句对应的情感色彩,分别生成对应各所述语句的语音,以使所述语音携带相应的情感色彩;Based on the emotional color corresponding to each sentence, the voice corresponding to each of the sentences is respectively generated, so that the voice carries the corresponding emotional color;
    播放生成的对应各所述语句的语音。Play the generated voice corresponding to each of the sentences.
  11. 如权利要求10所述的方法,其中,所述确定所述文本内容中各语句对应的情感色彩,包括:The method of claim 10, wherein the determining the emotional color corresponding to each sentence in the text content comprises:
    对所述文本内容中各语句进行情感标签提取,得到各所述语句对应的情感标签;Extracting emotion tags for each statement in the text content, to obtain emotion tags corresponding to each statement;
    采用提取的各所述语句对应的所述情感标签,表示相应的所述语句对应的情感色彩;Using the emotion labels corresponding to the extracted sentences to represent the emotional colors corresponding to the corresponding sentences;
    基于各语句对应的情感色彩,分别生成对应各所述语句的语音,包括:Based on the emotional color corresponding to each statement, the speech corresponding to each statement is respectively generated, including:
    分别确定与各所述情感标签相匹配的语音参数,所述语音参数包括音质、音律中至少之一;Respectively determine the voice parameters that match each of the emotional tags, and the voice parameters include at least one of sound quality and melody;
    基于各所述语音参数,生成各所述语句的语音。Based on each of the speech parameters, the speech of each of the sentences is generated.
  12. 如权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, wherein the method further comprises:
    当播放至所述文本内容中的对话内容时,呈现卡通人物,并播放所述卡通人物采用所述音色朗读所述对话内容的动画;When playing the dialogue content in the text content, present a cartoon character, and play an animation in which the cartoon character uses the timbre to read the dialogue content aloud;
    其中,所述卡通人物与所述对话内容所述角色的角色特征相匹配。Wherein, the cartoon characters match the character characteristics of the characters in the dialogue content.
  13. 一种文章的语音播放装置,所述装置包括:A voice playback device of an article, the device comprising:
    呈现模块,配置为在文章的内容界面中,呈现文章的文本内容以及对应所述文章的语音播放功能项;A presentation module, configured to present the text content of the article and the voice playback function item corresponding to the article in the content interface of the article;
    接收模块,配置为接收到基于所述语音播放功能项触发的针对所述文章的语音播放指令;a receiving module, configured to receive a voice play instruction for the article triggered based on the voice play function item;
    第一播放模块,配置为响应于所述语音播放指令,通过语音播放所述文本内容;a first playing module, configured to play the text content by voice in response to the voice play instruction;
    第二播放模块,配置为在通过语音播放所述文本内容的过程中,当所述文本内容包括至少一个角色时,对于与所述角色对应的内容,采用与所述角色的角色特征的相匹配的音色进行播放。The second playing module is configured to, during the process of playing the text content by voice, when the text content includes at least one character, for the content corresponding to the character, use the content that matches the character characteristics of the character. sound to play.
  14. 一种计算机设备,其中,包括:A computer device comprising:
    存储器,用于存储可执行指令;memory for storing executable instructions;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至12任一项所述的文章的语音播放方法。The processor is configured to implement the voice playing method of the article according to any one of claims 1 to 12 when executing the executable instructions stored in the memory.
  15. 一种计算机可读存储介质,其中,存储有可执行指令,用于被处理器执行时,实现权利要求1至12任一项所述的文章的语音播放方法。A computer-readable storage medium, wherein executable instructions are stored, and when executed by a processor, implement the voice playing method of the article according to any one of claims 1 to 12.
  16. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现权利要求1至12任一项所述的文章的语音播放方法。A computer program product, comprising a computer program or an instruction, when the computer program or instruction is executed by a processor, the voice playback method of the article of any one of claims 1 to 12 is implemented.
PCT/CN2022/078610 2021-03-04 2022-03-01 Speech playing method and apparatus for article, and device, storage medium and program product WO2022184055A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110241752.7 2021-03-04
CN202110241752.7A CN113010138B (en) 2021-03-04 2021-03-04 Article voice playing method, device and equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2022184055A1 true WO2022184055A1 (en) 2022-09-09

Family

ID=76405700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078610 WO2022184055A1 (en) 2021-03-04 2022-03-01 Speech playing method and apparatus for article, and device, storage medium and program product

Country Status (2)

Country Link
CN (1) CN113010138B (en)
WO (1) WO2022184055A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115220608A (en) * 2022-09-20 2022-10-21 深圳市人马互动科技有限公司 Method and device for processing multimedia data in interactive novel
CN115499401A (en) * 2022-10-18 2022-12-20 康键信息技术(深圳)有限公司 Method, system, computer equipment and medium for playing voice data

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010138B (en) * 2021-03-04 2023-04-07 腾讯科技(深圳)有限公司 Article voice playing method, device and equipment and computer readable storage medium
CN113851106B (en) * 2021-08-17 2023-01-06 北京百度网讯科技有限公司 Audio playing method and device, electronic equipment and readable storage medium
CN115238111B (en) * 2022-06-15 2023-11-14 荣耀终端有限公司 Picture display method and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6476815B1 (en) * 1998-10-19 2002-11-05 Canon Kabushiki Kaisha Information processing apparatus and method and information transmission system
CN103020105A (en) * 2011-09-27 2013-04-03 株式会社东芝 Document reading-out support apparatus and method
CN103546763A (en) * 2012-07-12 2014-01-29 三星电子株式会社 Method for providing contents information and broadcast receiving apparatus
CN105955609A (en) * 2016-04-25 2016-09-21 乐视控股(北京)有限公司 Voice reading method and apparatus
CN111367490A (en) * 2020-02-28 2020-07-03 广州华多网络科技有限公司 Voice playing method and device and electronic equipment
WO2020209647A1 (en) * 2019-04-09 2020-10-15 네오사피엔스 주식회사 Method and system for generating synthetic speech for text through user interface
CN111813301A (en) * 2020-06-03 2020-10-23 维沃移动通信有限公司 Content playing method and device, electronic equipment and readable storage medium
CN112765971A (en) * 2019-11-05 2021-05-07 北京火山引擎科技有限公司 Text-to-speech conversion method and device, electronic equipment and storage medium
CN113010138A (en) * 2021-03-04 2021-06-22 腾讯科技(深圳)有限公司 Article voice playing method, device and equipment and computer readable storage medium

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4192936B2 (en) * 2005-10-11 2008-12-10 ヤマハ株式会社 Automatic performance device
KR20190100428A (en) * 2016-07-19 2019-08-28 게이트박스 가부시키가이샤 Image display apparatus, topic selection method, topic selection program, image display method and image display program
CN109979430B (en) * 2017-12-28 2021-04-20 深圳市优必选科技有限公司 Robot story telling method and device, robot and storage medium
CN108962219B (en) * 2018-06-29 2019-12-13 百度在线网络技术(北京)有限公司 method and device for processing text
TWI685835B (en) * 2018-10-26 2020-02-21 財團法人資訊工業策進會 Audio playback device and audio playback method thereof
CN109523988B (en) * 2018-11-26 2021-11-05 安徽淘云科技股份有限公司 Text deduction method and device
CN109410913B (en) * 2018-12-13 2022-08-05 百度在线网络技术(北京)有限公司 Voice synthesis method, device, equipment and storage medium
CN109658916B (en) * 2018-12-19 2021-03-09 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, storage medium and computer equipment
CN109523986B (en) * 2018-12-20 2022-03-08 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus, device and storage medium
US10834029B2 (en) * 2019-01-29 2020-11-10 International Business Machines Corporation Automatic modification of message signatures using contextual data
CN110459200A (en) * 2019-07-05 2019-11-15 深圳壹账通智能科技有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN111158630B (en) * 2019-12-25 2023-06-23 网易(杭州)网络有限公司 Playing control method and device
CN111341318B (en) * 2020-01-22 2021-02-12 北京世纪好未来教育科技有限公司 Speaker role determination method, device, equipment and storage medium
CN111524501B (en) * 2020-03-03 2023-09-26 北京声智科技有限公司 Voice playing method, device, computer equipment and computer readable storage medium
CN111415650A (en) * 2020-03-25 2020-07-14 广州酷狗计算机科技有限公司 Text-to-speech method, device, equipment and storage medium
CN111667811B (en) * 2020-06-15 2021-09-07 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and medium
CN111785246A (en) * 2020-06-30 2020-10-16 联想(北京)有限公司 Virtual character voice processing method and device and computer equipment
CN112164407A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Tone conversion method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6476815B1 (en) * 1998-10-19 2002-11-05 Canon Kabushiki Kaisha Information processing apparatus and method and information transmission system
CN103020105A (en) * 2011-09-27 2013-04-03 株式会社东芝 Document reading-out support apparatus and method
CN103546763A (en) * 2012-07-12 2014-01-29 三星电子株式会社 Method for providing contents information and broadcast receiving apparatus
CN105955609A (en) * 2016-04-25 2016-09-21 乐视控股(北京)有限公司 Voice reading method and apparatus
WO2020209647A1 (en) * 2019-04-09 2020-10-15 네오사피엔스 주식회사 Method and system for generating synthetic speech for text through user interface
CN112765971A (en) * 2019-11-05 2021-05-07 北京火山引擎科技有限公司 Text-to-speech conversion method and device, electronic equipment and storage medium
CN111367490A (en) * 2020-02-28 2020-07-03 广州华多网络科技有限公司 Voice playing method and device and electronic equipment
CN111813301A (en) * 2020-06-03 2020-10-23 维沃移动通信有限公司 Content playing method and device, electronic equipment and readable storage medium
CN113010138A (en) * 2021-03-04 2021-06-22 腾讯科技(深圳)有限公司 Article voice playing method, device and equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115220608A (en) * 2022-09-20 2022-10-21 深圳市人马互动科技有限公司 Method and device for processing multimedia data in interactive novel
CN115499401A (en) * 2022-10-18 2022-12-20 康键信息技术(深圳)有限公司 Method, system, computer equipment and medium for playing voice data

Also Published As

Publication number Publication date
CN113010138A (en) 2021-06-22
CN113010138B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
WO2022184055A1 (en) Speech playing method and apparatus for article, and device, storage medium and program product
US10902841B2 (en) Personalized custom synthetic speech
CN108806656B (en) Automatic generation of songs
JP6876752B2 (en) Response method and equipment
CN108806655B (en) Automatic generation of songs
US20200126566A1 (en) Method and apparatus for voice interaction
JP2018537727A5 (en)
WO2022121181A1 (en) Intelligent news broadcasting method, apparatus and device, and storage medium
US20100318363A1 (en) Systems and methods for processing indicia for document narration
WO2007043679A1 (en) Information processing device, and program
US20200058288A1 (en) Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
JP2015517684A (en) Content customization
CN110517689A (en) A kind of voice data processing method, device and storage medium
US20020072900A1 (en) System and method of templating specific human voices
US20140019137A1 (en) Method, system and server for speech synthesis
JP2019091014A (en) Method and apparatus for reproducing multimedia
Schuller et al. Synthesized speech for model training in cross-corpus recognition of human emotion
CN112750187A (en) Animation generation method, device and equipment and computer readable storage medium
CN109102800A (en) A kind of method and apparatus that the determining lyrics show data
WO2021149929A1 (en) System for providing customized video producing service using cloud-based voice combining
Pauletto et al. Exploring expressivity and emotion with artificial voice and speech technologies
WO2007069512A1 (en) Information processing device, and program
US20050108011A1 (en) System and method of templating specific human voices
CN110741430A (en) Singing synthesis method and singing synthesis system
WO2022156479A1 (en) Custom tone and vocal synthesis method and apparatus, electronic device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22762520

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 31/01/2024)