CN111428079B

CN111428079B - Text content processing method, device, computer equipment and storage medium

Info

Publication number: CN111428079B
Application number: CN202010209314.8A
Authority: CN
Inventors: 罗忠岚
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2023-11-28
Anticipated expiration: 2040-03-23
Also published as: CN111428079A

Abstract

The application discloses a text content processing method, a text content processing device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: receiving a target audio resource acquisition request of target text content, wherein the target text content comprises text content of at least one role; acquiring target audio resources of the target text content based on tone types corresponding to different roles in the target text content and text content corresponding to different roles in the target text content; and storing the target audio resource of the target text content. The target audio resource not only can reserve the target text content, but also can express the text content corresponding to different roles in the target text content by different tone types, so that the target text content is more real.

Description

Text content processing method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a text content processing method, a text content processing device, computer equipment and a storage medium.

Background

With the rapid development of computer technology, the daily life rhythm of people is gradually accelerated, and more people start recording drops in life. For example, the recording may be performed by using a text method or may be performed by using a video method.

In the related art, an application program supporting text input, in which a user can input text contents, is installed and run on a computer device. The text recording method only records text content, and cannot accurately record the situation when the text content occurs. The user can also use the image pickup device to record the video, and the video recording method can accurately record the situation when the text content occurs, but the user needs to spend a great deal of time and energy to take video shooting, unnecessary text content can appear in the shot video, and the waste of resources and time cost are easily caused.

Therefore, there is a need for a text content processing method that can save time and improve processing efficiency while preserving the context of the text content.

Disclosure of Invention

The embodiment of the application provides a text content processing method, a text content processing device, computer equipment and a storage medium, which can be used for solving the problems in the related technology. The technical scheme is as follows:

In one aspect, an embodiment of the present application provides a text content processing method, where the method includes:

receiving a target audio resource acquisition request of target text content, wherein the target text content comprises text content of at least one role;

acquiring target audio resources of the target text content based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, wherein the target audio resources adopt different tone types to represent the text contents corresponding to different roles;

and storing the target audio resource of the target text content.

In one possible implementation manner, the obtaining, based on tone types corresponding to different roles in the target text content and text content corresponding to different roles in the target text content, a target audio resource of the target text content includes:

responding to the target audio resource acquisition request, and sending a target audio resource acquisition instruction to a target server, wherein the target audio resource acquisition instruction carries an identifier of the target text content;

and receiving a target audio resource returned by the target server, wherein the target audio resource is generated based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, and the target audio resource is an audio resource corresponding to the identification of the target text content.

determining a phoneme sequence corresponding to the target text content;

identifying text contents corresponding to different roles in the target text contents, and determining phoneme sequences corresponding to the text contents corresponding to the different roles in the target text contents;

synthesizing phoneme sequences corresponding to text contents corresponding to different roles in the target text contents based on tone types corresponding to the different roles in the target text contents to obtain audio resources corresponding to the different roles;

and obtaining the target audio resources corresponding to the target text content according to the audio resources corresponding to the different roles.

In one possible implementation manner, the identifying text content corresponding to different roles in the target text content includes:

identifying character names and keywords corresponding to different characters in the target text content, wherein the keywords are used for identifying the starting positions of the text content corresponding to the character names;

and determining the text content corresponding to the character name and the keyword as the text content corresponding to the character name.

In one possible implementation manner, before the obtaining the target audio resource of the target text content based on the tone type corresponding to the different roles in the target text content and the text content corresponding to the different roles in the target text content, the method further includes:

acquiring an audio configuration file, wherein the audio configuration file comprises a plurality of tone color models, and each tone color model comprises a plurality of tone color types;

and determining tone types corresponding to each role in the target text content according to the audio configuration file.

In one possible implementation, the storing the target audio resource of the target text content includes:

determining the music meeting the target conditions in the music library as background music;

synthesizing the target audio resource and the background music to obtain a synthesized target audio resource;

and storing the synthesized target audio resource.

In one possible implementation manner, the determining the music meeting the target condition in the music library as the background music includes:

calculating the matching degree between the music in the music library and the target text content, and determining the music with the matching degree meeting the first target condition as background music;

Or, acquiring a ranking index of the music in the music library, and determining the music with the ranking index meeting the second target condition as background music.

In one possible implementation, the method further includes:

receiving a play request of the target audio resource, wherein the play request carries an identifier of the target text content;

acquiring the target text content and a target audio resource of the target text content based on the identification of the target text content;

displaying the target text content;

and playing the target audio resource of the target text content.

In another aspect, an embodiment of the present application provides a text content processing apparatus, including:

the receiving module is used for receiving a target audio resource acquisition request of target text content, wherein the target text content comprises text content of at least one role;

the acquisition module is used for acquiring target audio resources of the target text content based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, wherein the target audio resources adopt different tone types to represent the text contents corresponding to different roles;

and the storage module is used for storing the target audio resources of the target text content.

In one possible implementation manner, the acquiring module is configured to send a target audio resource acquiring instruction to a target server in response to the target audio resource acquiring request, where the target audio resource acquiring instruction carries an identifier of the target text content;

In one possible implementation manner, the obtaining module is configured to determine a phoneme sequence corresponding to the target text content;

In one possible implementation manner, the obtaining module is configured to identify a character name and a keyword corresponding to different characters in the target text content, where the keyword is used to identify a starting position of the text content corresponding to the character name;

In one possible implementation, the obtaining module is further configured to obtain an audio configuration file, where the audio configuration file includes a plurality of timbre models, and each timbre model includes a plurality of timbre types;

the apparatus further comprises:

and the determining module is used for determining tone types corresponding to each role in the target text content according to the audio configuration file.

In one possible implementation manner, the storage module is used for determining the music meeting the target condition in the music library as background music;

and storing the synthesized target audio resource.

In a possible implementation manner, the storing module is configured to calculate a matching degree between the music in the music library and the target text content, and determine the music with the matching degree meeting the first target condition as background music;

In a possible implementation manner, the receiving module is further configured to receive a play request of the target audio resource, where the play request carries an identifier of the target text content;

the acquisition module is further used for acquiring the target text content and the target audio resource of the target text content based on the identification of the target text content;

the apparatus further comprises:

the display module is used for displaying the target text content;

and the playing module is used for playing the target audio resource of the target text content.

In another aspect, a computer device is provided that includes a processor and a memory having stored therein at least one piece of program code that is loaded and executed by the processor to implement any of the text content processing methods described above.

In another aspect, there is also provided a computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement any of the above text content processing methods.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

according to the technical scheme provided by the embodiment of the application, the target audio resource of the target text content is obtained based on the tone types of different roles in the target text content and the text content corresponding to the different roles in the target text content, the target audio resource not only can reserve the target text content, but also can represent the text content corresponding to the different roles in the target text content by using the different tone types, so that the target text content is more real, and the processing process of the target text content does not need the participation of a user, so that the processing time of the text content can be saved, and the processing efficiency can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a text content processing method according to an embodiment of the present application;

FIG. 2 is a flowchart of a text content processing method according to an embodiment of the present application;

FIG. 3 is a flowchart of a text content processing method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a text content processing device according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a computer device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a target server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The relevant terms related to the present application are explained below:

the voice synthesis technology comprises the following steps: is a technique for generating artificial voice by a mechanical and electronic method. Text To Speech (TTS) technology is a subject of Speech synthesis technology, which converts Text into Speech output, allowing the machine To speak.

Natural language processing (Natural Language Processing, NLP): the method is an important direction in the field of computer science and artificial intelligence, and natural language processing can realize various theories and methods for effectively communicating between a person and a computer by using natural language, namely, a robot can understand the meaning of speaking.

Fig. 1 is a schematic diagram of an implementation environment of a text content processing method according to an embodiment of the present application, referring to fig. 1, the implementation environment includes: a computer device 101 and a target server 102.

The computer device 101 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer 4) player, and a laptop portable computer. The computer device 101 may receive a target audio resource acquisition request triggered by a user, acquire a target audio resource of the target text content based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, and may further store the target audio resource of the target text content. The computer device 101 may also communicate with the target server 102 via a wired network or a wireless network, respond to a target audio resource acquisition request, send a target audio resource acquisition instruction to the target server, and receive a target audio resource returned by the target server.

The computer device 101 may refer broadly to one of a plurality of computer devices, the present embodiment being illustrated by way of example only with the computer device 101. Those skilled in the art will appreciate that the number of computer devices described above may be greater or lesser. For example, the number of the computer devices may be only one, or the number of the computer devices may be tens or hundreds, or more, and the number and the device type of the computer devices are not limited in the embodiment of the present application.

The target server 102 may be one server, multiple servers, or at least one of a cloud computing platform and a virtualization center. The target server 102 may communicate with the computer device 101 through a wired network or a wireless network, and the target server 102 receives an audio resource acquisition instruction sent by the computer device 101, and responds to the audio resource acquisition instruction to acquire a target audio resource of the target text content, where the target audio resource is generated based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content. The target server 102 may also send the target audio resource of the target text content to the computer device 101. Alternatively, the number of the target servers may be greater or less, which is not limited by the embodiment of the present application. Of course, the target server 102 may also include other functional servers to provide more comprehensive and diverse services.

Based on the above implementation environment, the embodiment of the present application provides a text content processing method, taking the flowchart of the text content processing method provided by the embodiment of the present application shown in fig. 2 as an example, the method may be executed by the computer device 101 in fig. 1. As shown in fig. 2, the method comprises the steps of:

in step 201, a target audio resource retrieval request for target text content is received, the target text content comprising text content of at least one character.

In the embodiment of the application, a plurality of text contents can be stored in the computer equipment, each text content can be dialogue contents among a plurality of roles or be a monologue of one role, and the number of the roles in the plurality of text contents stored in the computer equipment is not limited.

In one possible implementation, the user may click on one of the plurality of text contents, and the computer device determines the text content clicked by the user as the target text content in response to the click operation of the user. The computer device may also jump to a display interface of the target text content, where text content and play buttons corresponding to each character in the target text content may be displayed.

In a possible implementation manner, a user may click a play button corresponding to a target text content in a display interface of the target text content, and when detecting the click operation, the computer device may generate a target audio resource acquisition request of the target text content, where the audio resource acquisition request may carry an identifier of the target text content, and the identifier of the target text content may be a name of the target text content or a number of the target text content, which is not limited by the embodiment of the present application.

For example, three text contents, i.e., text content 1, text content 2, and text content 3, are stored in the computer device. The user clicks the text content 1 among the three text contents, and the computer device determines the text content 1 as the target text content in response to the clicking operation of the user. The user can click a play button in the display interface of the text content 1, and the computer equipment generates an audio resource acquisition request corresponding to the text content 1 according to the click operation, that is, the audio resource acquisition request for acquiring the target text content.

In step 202, a target audio resource of the target text content is obtained based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, where the text contents corresponding to different roles are represented by different tone types in the target audio resource.

In an exemplary embodiment of the present application, based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, the method may include the following steps:

step 2021, determining a phoneme sequence corresponding to the target text content.

In the embodiment of the application, the computer equipment can convert the text content generated by the computer equipment or input by a user into intelligible and fluent spoken language output through the TTS technology. By way of example, speech synthesis by TTS technology can be divided into three processes, text-to-phoneme, frequency prediction, and audio synthesis. Wherein, the phonemes are the minimum phonetic units divided according to the natural properties of the speech. From an acoustic standpoint, a phoneme is the smallest unit of speech that is separated from a sound quality perspective. From a physiological standpoint, a pronunciation action forms a phoneme. Such as [ wo ] contains two pronunciation actions of [ w ] and [ o ], i.e., two phonemes.

In one possible implementation, the correspondence between each word and the phonemes of that word is stored in a phonemic dictionary. The computer device may obtain the target text content, and view phonemes corresponding to each word in the target text content in the phoneme dictionary, thereby obtaining a phoneme sequence corresponding to the target text content.

It should be noted that, in the process of determining the phoneme sequence corresponding to the target text content, if a situation that a phoneme corresponding to a certain word in the target text content cannot be queried in the phoneme dictionary occurs, the phoneme corresponding to the word may be queried in other manners, which is not limited in the embodiment of the present application.

Step 2022, identifying text contents corresponding to different roles in the target text content, and determining phoneme sequences corresponding to the text contents corresponding to different roles in the target text content.

In one possible implementation manner, the process of identifying text content corresponding to different roles in the target text content is as follows: identifying character names and keywords corresponding to different characters in the target text content, wherein the keywords are used for identifying the starting positions of the text content corresponding to the character names; and determining the text content corresponding to the character name and the keyword as the text content corresponding to the character name.

In a possible implementation manner, the computer device may identify the target text content through semantic recognition, determine character names corresponding to different characters in the target text content, and identify keywords corresponding to the character names in a reference number of characters after the character names, where the reference number of characters may be 3 characters or 5 characters, and the embodiment of the present application is not limited to this. The key word is used for identifying the starting position of the text content corresponding to the character name. Illustratively, the keywords may be "say", "answer", "question", etc. In the embodiment of the present application, the keywords corresponding to different roles of the target text content may also be other words or words, which are not limited in the embodiment of the present application.

For example, the target text content includes character a and character B, and the dialogue content between character a and character B is as follows:

role a says: "teacher, he afraid of needles, last pricked preventive needles, he said he wants to become pangolin scales. "

Character B asks: "why does it want to become pangolin? "

Role a answer: because the skin of pangolin scales is thick, the needle cannot be inserted. "

Role B laughs say: "ha to ultra-high" one to one "

In the above example, the computer device may identify the piece of text content, determine that the character names appearing in the piece of text content have character a and character B, and identify keywords "say", "answer", "question" in the text content after the character names.

In one possible implementation manner, the computer device may further identify a punctuation mark in the target text content, identify a punctuation mark in the target text content for indicating the speaking content, determine the content indicated by the punctuation mark as the text content corresponding to the character name, and determine the character name of the character according to the punctuation mark. For example, the character name of the character is determined in a reference number of characters before and after the punctuation mark. The reference number of words may be 3 words or 5 words, which is not limited in the embodiment of the present application. Illustratively, the character name of the character may be determined within 3 words before and after the punctuation mark.

In one possible implementation, the computer device may, after identifying the role name and key, be located after the role name and/or key: or the initial identifier of the text content corresponding to the character name is determined to be the termination identifier of the text content corresponding to the character name, and the text content between the initial identifier and the termination identifier is further determined to be the text content corresponding to the character name. Alternatively, the start identifier and the end identifier may also include other various types, which are not limited by the embodiment of the present application.

In a possible implementation manner, based on the phoneme sequence corresponding to the target text content determined in the step 2021 and the text content corresponding to the different roles in the target text content obtained in the step 2022, the phoneme sequence corresponding to the text content corresponding to the different roles is determined in the phoneme sequence corresponding to the target text content.

Step 2023, based on the tone types corresponding to the different roles in the target text content, synthesizes the phoneme sequences corresponding to the text content corresponding to the different roles in the target text content, and obtains the audio resources corresponding to the different roles.

In the embodiment of the application, based on tone types corresponding to different roles in the target text content, before synthesizing phoneme sequences corresponding to the text content corresponding to the different roles in the target text content, the tone types corresponding to the different roles in the target text content also need to be determined. The determination process of tone types corresponding to different roles is as follows:

step one, acquiring an audio configuration file, wherein the audio configuration file comprises a plurality of tone models, and each tone model comprises a plurality of tone colors.

In one possible implementation, a computer device may obtain various profiles of a plurality of text content, such as audio profiles, from a provider of the text content. The audio configuration file includes a plurality of tone models, such as a male acoustic model, a female acoustic model, a child acoustic model, and the like. The plurality of timbre models respectively comprise a plurality of timbre types. The plurality of tone types included in the male acoustic model are respectively a positive tone, a young tone, a blue-tertiary tone, a large-tertiary tone, an aged tone, an imperial tone, a male tone, and the like.

In one possible implementation, each text content and an audio profile corresponding to the text content are stored in a computer device. After receiving the target audio resource acquisition request of the target text content, the computer equipment analyzes the target audio resource acquisition request to obtain the identification of the target text content. And acquiring an audio configuration file corresponding to the target text content based on the identification of the target text content.

And step two, determining tone types corresponding to each role in the target text content according to the audio configuration file.

In one possible implementation, the computer device determines, according to an audio profile corresponding to the target text content, a plurality of tone types that occur in the audio profile. And determining the tone type corresponding to each role in the target text content according to the tone types and the roles in the target text content.

For example, if three roles appear in the audio file corresponding to the target text content, based on the audio configuration file, it is determined that tone types corresponding to the three roles are respectively: character 1 is sunman sound, character 2 is sweet woman sound, and character 3 is young child sound.

In one possible implementation manner, based on the tone type corresponding to the different roles in the determined target text content, the phoneme sequence corresponding to the text content corresponding to the different roles determined in the step 2022 is synthesized, so as to obtain the audio resource corresponding to the text content of the different roles.

Step 2024, obtaining the target audio resources corresponding to the target text content according to the audio resources corresponding to the different roles.

In one possible implementation manner, the audio resources corresponding to each role are synthesized according to the departure sequence of each role, that is, the departure sequence of the text content corresponding to each role in the target text content, so as to obtain the target audio resources corresponding to the target text content.

Illustratively, the audio resources of each character obtained in step 2023 may be spliced according to the departure sequence of each character, so as to obtain the target audio resources of the target text content. For example, for the text content shown in the above step 2022, according to the order of presence of the character a and the character B, the first audio resource of the character a, the first audio resource of the character B, the second audio resource of the character a, and the second audio resource of the character B are spliced in this order, thereby obtaining the target audio resource.

In step 203, a target audio resource of the target text content is saved.

In an embodiment of the present application, the computer device may save the target audio resource of the target text content obtained in step 202.

In one possible implementation manner, the computer device may further determine the music meeting the target condition in the music library as background music, synthesize the target audio resource obtained in the step 202 and the background music, obtain a synthesized target audio resource, and store the synthesized target audio resource.

In one possible implementation manner, the music meeting the target condition in the music library can be determined as background music in any one of the following implementation manners:

in the first implementation manner, the matching degree between the music in the music library and the target text content is calculated, and the music with the matching degree meeting the first target condition is determined to be background music.

In one possible implementation, the computer device may obtain all the music in its memory space, and calculate the matching degree between each piece of music and the target text content in turn. The computer device may also sort the obtained matching degree of each piece of music according to the order from high to low, or sort the pieces of music according to the order from low to high. And according to the sorting result, determining the music with the matching degree meeting the first target condition as background music. For example, the music satisfying the first target condition may be the music with the highest matching degree, which is not limited by the embodiment of the present application.

In one possible implementation, the process of calculating the matching degree between the music in the music library and the target text content may be as follows: and inputting the target text content and the music in the music library into a target matching degree calculation model, and obtaining the matching degree between the music in the music library and the target text content based on the output result of the target matching degree calculation model.

For example, the target text content and music 1, music 2, and music 3 are input into the target matching degree calculation model, and the matching degree of music 1 is 85%, the matching degree of music 2 is 95%, and the matching degree of music 3 is 80% based on the output result of the target matching degree calculation model. Therefore, the music 2 can be regarded as background music conforming to the first target condition.

In one possible implementation manner, the manner of obtaining the target matching degree calculation model may be as follows: the computer device may obtain a genre of each piece of music in the music library, which may be pop music, rock music, jazz music, or the like, for example. The computer equipment can acquire a plurality of text contents, and train the initial matching degree calculation model based on the plurality of text contents and the variety style of each piece of music in the music library, so as to obtain a target matching degree calculation model with higher precision.

The initial matching degree calculation model may be any type of neural network model, which is not limited in the embodiment of the present application. For example, the initial matching calculation model may be a trend value matching model (Price Sensitivity Measurement, PSM).

In the second implementation manner, the ranking index of the music in the music library is obtained, and the music with the ranking index meeting the second target condition is determined to be background music.

In one possible implementation, the computer device may obtain a music ranking list in the music library, determine, based on the music ranking list, music whose ranking index satisfies the second target condition as background music, e.g., the music whose ranking index satisfies the second target condition may be the music whose ranking index is highest.

In one possible implementation, the computer device may also obtain a ranking list of music in the music library, determine music with ranking indices within a reference number as candidate background music, and randomly determine a piece of music as background music among the candidate background music. The music with the ranking index within the reference number may be, for example, the music with the ranking index in the first 3 bits or the music with the ranking index in the first 5 bits, which is not limited by the embodiment of the present application.

For example, the music with the ranking index of the top 5 bits in the music library is determined as candidate background music, namely candidate background music 1, candidate background music 2, candidate background music 3, candidate background music 4 and candidate background music 5, and one candidate background music is randomly determined as background music in the 5 candidate background music, for example, candidate background music 3 is determined as background music.

In the third implementation manner, in response to a determination operation by a user, music indicated by the determination operation is determined as background music.

In one possible implementation, the user may access a music library in the computer device, and the user may manually select a song in the music library according to his own preference, and after the computer device detects a determining operation by the user, determine the music indicated by the determining operation as background music.

It should be noted that, the computer device may select any of the above implementations to determine the background music in the music library, which is not limited by the embodiment of the present application.

In the embodiment of the application, when a user wants to listen to a target audio resource of a certain target text content, a play button can be clicked in a display interface of the target text content, that is, a play request of the target audio resource of the target text content is sent to the computer device, and the play request carries an identifier of the target text content. After receiving the play request of the target audio resource, the computer equipment analyzes the play request to obtain the identification of the target text content carried in the play request. And acquiring the target text content and the target audio resource of the target text content based on the identification of the target text content. The computer device may also play a target audio asset of the target text content. Of course, the computer device may also display the target text content while playing the target audio asset of the target text content.

In one possible implementation, the computer device may also add the target audio resource of the target text content to the VR scene or video file, resulting in a new VR scene or video file. That is, if the video file is stored in the computer device, the target audio resource may be imported into the video file, so that the video file has not only pictures but also sound, so that the video file is more abundant, and the experience of the user watching the video file may be improved to a certain extent.

According to the method, based on tone types of different roles in the target text content and text content corresponding to the different roles in the target text content, the target audio resource of the target text content is obtained, the target text content can be reserved, the text content corresponding to the different roles in the target text content can be represented by the different tone types, the target text content is more real, and the processing process of the target text content does not need participation of a user, so that the processing time of the text content can be saved, and the processing efficiency is improved.

Fig. 3 is a flowchart of a text content processing method according to an embodiment of the present application, which is described in terms of interactions between a computer device 101 and a target server 102. Referring to fig. 3, the method includes:

In step 301, a computer device receives a target audio resource retrieval request for target text content, the target text content including text content of at least one character.

In the embodiment of the present application, the process of receiving the target audio resource obtaining request of the target text content by the computer device is consistent with the process of step 201, and will not be described herein.

In step 302, the computer device sends a target audio resource acquisition instruction to the target server in response to the target audio resource acquisition request, the target audio resource acquisition instruction carrying an identification of the target text content.

In the embodiment of the present application, after receiving the target audio resource obtaining request in the above step 301, the computer device may directly send a target audio resource obtaining instruction to the target server. Or after receiving the acquisition request of the target server, sending a target audio resource acquisition instruction to the target server. The embodiment of the application does not limit the sending time of the target audio resource acquisition instruction.

In step 303, the target server receives the target audio resource obtaining instruction, obtains a target audio resource of the target text content based on the target audio resource obtaining instruction, where the target audio resource is generated based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, and the target audio resource is an audio resource corresponding to an identifier of the target text content.

In the embodiment of the present application, the process of obtaining the target audio resource of the target text content by the target server is consistent with the process of step 202, which is not described herein.

In step 304, the target server transmits the target audio resource of the target text content to the computer device.

In the embodiment of the present application, after the target server obtains the target audio resource of the target text content in the step 303, the target audio resource of the target text content may be directly sent to the computer device. The target audio resource of the target text content may also be sent to the computer device after receiving the audio resource acquisition request sent by the computer device. The embodiment of the application does not limit the sending time of the target audio resource of the target text content.

In step 305, the computer device receives the target audio resource of the target text content returned by the target server, and saves the target audio resource of the target text content.

In the embodiment of the present application, the process of saving the target audio resource of the target text content by the computer device is consistent with the process of step 203, which is not described herein.

Fig. 4 is a schematic structural diagram of a text content processing device according to an embodiment of the present application, where, as shown in fig. 4, the device includes:

a receiving module 401, configured to receive a target audio resource acquisition request of target text content, where the target text content includes text content of at least one character;

an obtaining module 402, configured to obtain a target audio resource of the target text content based on tone types corresponding to different roles in the target text content and text contents corresponding to different roles in the target text content, where the target audio resource uses different tone types to represent text contents corresponding to different roles;

a saving module 403, configured to save the target audio resource of the target text content.

In a possible implementation manner, the obtaining module 402 is configured to send, to a target server, a target audio resource obtaining instruction in response to the target audio resource obtaining request, where the target audio resource obtaining instruction carries an identifier of the target text content;

In one possible implementation, the obtaining module 402 is configured to determine a phoneme sequence corresponding to the target text content;

In a possible implementation manner, the obtaining module 402 is configured to identify a character name and a keyword corresponding to different characters in the target text content, where the keyword is used to identify a starting position of the text content corresponding to the character name;

In one possible implementation, the obtaining module 402 is further configured to obtain an audio configuration file, where the audio configuration file includes a plurality of timbre models, and each timbre model includes a plurality of timbre types;

The apparatus further comprises:

In a possible implementation manner, the storing module 403 is configured to determine, as background music, music in the music library that meets the target condition;

and storing the synthesized target audio resource.

In a possible implementation manner, the saving module 403 is configured to calculate a matching degree between the music in the music library and the target text content, and determine the music with the matching degree meeting the first target condition as the background music;

In a possible implementation manner, the receiving module 401 is further configured to receive a play request of the target audio resource, where the play request carries an identifier of the target text content;

the obtaining module 402 is further configured to obtain the target text content and a target audio resource of the target text content based on the identification of the target text content;

The apparatus further comprises:

the display module is used for displaying the target text content;

The device acquires the target audio resource of the target text content based on the tone types of different roles in the target text content and the text content corresponding to the different roles in the target text content, and the target audio resource not only can reserve the target text content, but also can express the text content corresponding to the different roles in the target text content by using the different tone types, so that the target text content is more real, and the processing process of the target text content does not need the participation of a user, so that the processing time of the text content can be saved, and the processing efficiency can be improved.

It should be noted that: in the text content processing device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the text content processing device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the text content processing device and the text content processing method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, which are not repeated herein.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device 500 may be: a smart phone, a tablet, an MP3 (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook or a desktop. Computer device 500 may also be referred to by other names of user devices, portable computer devices, laptop computer devices, desktop computer devices, and the like.

In general, the computer device 500 includes: one or more processors 501 and one or more memories 502.

Processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 501 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 501 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 501 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 501 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one program code for execution by processor 501 to implement the text content processing method provided by the method embodiments of the present application.

In some embodiments, the computer device 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502, and peripheral interface 503 may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface 503 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, a display 505, a camera 506, audio circuitry 507, a positioning component 508, and a power supply 509.

Peripheral interface 503 may be used to connect at least one Input/Output (I/O) related peripheral to processor 501 and memory 502. In some embodiments, processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 501, memory 502, and peripheral interface 503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 504 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 504 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 504 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 504 may communicate with other computer devices via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 504 may also include NFC (Near Field Communication ) related circuitry, which is not limited by the present application.

The display 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 505 is a touch display, the display 505 also has the ability to collect touch signals at or above the surface of the display 505. The touch signal may be input as a control signal to the processor 501 for processing. At this time, the display 505 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 505 may be one, providing a front panel of the computer device 500; in other embodiments, the display 505 may be at least two, respectively disposed on different surfaces of the computer device 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or a folded surface of the computer device 500. Even more, the display 505 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 505 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 506 is used to capture images or video. Optionally, the camera assembly 506 includes a front camera and a rear camera. Typically, the front camera is disposed on a front panel of the computer device and the rear camera is disposed on a rear surface of the computer device. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 for voice communication. The microphone may be provided in a plurality of different locations of the computer device 500 for stereo acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 507 may also include a headphone jack.

The location component 508 is used to locate the current geographic location of the computer device 500 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 508 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.

The power supply 509 is used to power the various components in the computer device 500. The power supply 509 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 509 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the computer device 500 further includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: an acceleration sensor 511, a gyro sensor 512, a pressure sensor 513, a fingerprint sensor 514, an optical sensor 515, and a proximity sensor 516.

The acceleration sensor 511 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the computer device 500. For example, the acceleration sensor 511 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 501 may control the display 505 to display a user interface in a landscape view or a portrait view according to a gravitational acceleration signal acquired by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the computer device 500, and the gyro sensor 512 may collect a 3D motion of the user on the computer device 500 in cooperation with the acceleration sensor 511. The processor 501 may implement the following functions based on the data collected by the gyro sensor 512: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side frame of the computer device 500 and/or on an underlying layer of the display 505. When the pressure sensor 513 is disposed on the side frame of the computer device 500, a grip signal of the computer device 500 by a user may be detected, and the processor 501 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 505. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 514 is used for collecting the fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 501 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back or side of the computer device 500. When a physical key or vendor Logo is provided on the computer device 500, the fingerprint sensor 514 may be integrated with the physical key or vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the display screen 505 based on the intensity of ambient light collected by the optical sensor 515. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 505 is turned up; when the ambient light intensity is low, the display brightness of the display screen 505 is turned down. In another embodiment, the processor 501 may also dynamically adjust the shooting parameters of the camera assembly 506 based on the ambient light intensity collected by the optical sensor 515.

A proximity sensor 516, also referred to as a distance sensor, is typically provided on the front panel of the computer device 500. The proximity sensor 516 is used to collect the distance between the user and the front of the computer device 500. In one embodiment, when the proximity sensor 516 detects a gradual decrease in the distance between the user and the front of the computer device 500, the processor 501 controls the display 505 to switch from the bright screen state to the off screen state; when the proximity sensor 516 detects that the distance between the user and the front of the computer device 500 gradually increases, the display 505 is controlled by the processor 501 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is not limiting as to the computer device 500, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.

Fig. 6 is a schematic structural diagram of a target server according to an embodiment of the present application, where the target server 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 601 and one or more memories 602, where at least one program code is stored in the one or more memories 602, and the at least one program code is loaded and executed by the one or more processors 601 to implement the text content processing method provided in each of the method embodiments described above. Of course, the target server 600 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one program code loaded and executed by a processor of a computer device to implement any of the above-described text content processing methods.

Alternatively, the above-mentioned computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only optical disk (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof, but rather, any modification, equivalent replacement, improvement or the like which comes within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of text content processing, the method comprising:

carrying out semantic recognition on the target text content to obtain character names corresponding to different characters in the target text content;

determining keywords corresponding to the role names in the characters of the reference number after the role names, wherein the keywords are used for identifying the initial positions of text contents corresponding to the role names;

The left double-quotation mark positioned behind the keyword is used as a starting mark of the text content corresponding to the character name, the right double-quotation mark is determined as a termination mark of the text content corresponding to the character name, and the text content positioned between the starting mark and the termination mark is determined as the text content corresponding to the character name;

determining phoneme sequences corresponding to text contents corresponding to different roles in the target text contents according to the phoneme sequences corresponding to the target text contents;

determining tone types corresponding to each role in the target text content, wherein tone types corresponding to different roles are different;

synthesizing phoneme sequences corresponding to the text contents corresponding to different roles according to tone types corresponding to the different roles in the target text contents to obtain audio resources corresponding to the text contents of the different roles;

splicing the audio resources corresponding to the different roles according to the departure sequence of the text contents corresponding to the different roles in the target text contents to obtain target audio resources corresponding to the target text contents;

calculating the matching degree between the music in the music library and the target text content, determining the music with the matching degree meeting a first target condition as background music, or acquiring a ranking index of the music in the music library, determining the music with the ranking index meeting a second target condition as background music, or determining the music indicated by the determining operation as the background music in response to the determining operation of a user;

and storing the synthesized target audio resource.

2. The method according to claim 1, wherein the method further comprises:

responding to the target audio resource acquisition request, and sending a target audio resource acquisition instruction to a target server, wherein the target audio resource acquisition instruction carries the identification of the target text content;

3. The method of claim 1, wherein the determining a tone type corresponding to each character in the target text content comprises:

acquiring an audio configuration file, wherein the audio configuration file comprises a plurality of tone models, and each tone model comprises a plurality of tone types;

4. The method according to claim 1, wherein the method further comprises:

displaying the target text content;

and playing the target audio resource of the target text content.

5. A text content processing apparatus, the apparatus comprising:

the acquisition module is used for carrying out semantic recognition on the target text content to obtain character names corresponding to different characters in the target text content; determining keywords corresponding to the role names in the characters of the reference number after the role names, wherein the keywords are used for identifying the initial positions of text contents corresponding to the role names; the left double-quotation mark positioned behind the keyword is used as a starting mark of the text content corresponding to the character name, the right double-quotation mark is determined as a termination mark of the text content corresponding to the character name, and the text content positioned between the starting mark and the termination mark is determined as the text content corresponding to the character name; determining phoneme sequences corresponding to text contents corresponding to different roles in the target text contents according to the phoneme sequences corresponding to the target text contents; determining tone types corresponding to each role in the target text content, wherein tone types corresponding to different roles are different; synthesizing phoneme sequences corresponding to the text contents corresponding to different roles according to tone types corresponding to the different roles in the target text contents to obtain audio resources corresponding to the text contents of the different roles; splicing the audio resources corresponding to the different roles according to the departure sequence of the text contents corresponding to the different roles in the target text contents to obtain target audio resources corresponding to the target text contents;

The storage module is used for calculating the matching degree between the music in the music library and the target text content, determining the music with the matching degree meeting the first target condition as background music, or acquiring the ranking index of the music in the music library, determining the music with the ranking index meeting the second target condition as background music, or determining the music indicated by the determination operation as the background music in response to the determination operation of a user; synthesizing the target audio resource and the background music to obtain a synthesized target audio resource; and storing the synthesized target audio resource.

6. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one piece of program code that is loaded and executed by the processor to implement the text content processing method of any of claims 1 to 4.

7. A computer readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor to implement the text content processing method of any of claims 1 to 4.