CN117690407A - Text information processing method, device, equipment and storage medium - Google Patents

Text information processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117690407A
CN117690407A CN202311691691.XA CN202311691691A CN117690407A CN 117690407 A CN117690407 A CN 117690407A CN 202311691691 A CN202311691691 A CN 202311691691A CN 117690407 A CN117690407 A CN 117690407A
Authority
CN
China
Prior art keywords
text
audio
information
dialogue
literary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311691691.XA
Other languages
Chinese (zh)
Inventor
黄杰雄
高阳升
谭家俊
缪晓鋆
李剑扬
江景敏
轩晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN202311691691.XA priority Critical patent/CN117690407A/en
Publication of CN117690407A publication Critical patent/CN117690407A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a text information processing method, device, equipment and storage medium, and belongs to the technical field of natural language. The method comprises the following steps: acquiring a literary work, wherein the literary work is natural language information comprising a plurality of characters; determining dialogue text and bystander text in the literature, wherein the dialogue text is text initiated by at least one character, and the bystander text is information except the dialogue text in the literature; generating a first audio in a speech synthesis manner based on the bystander; and generating a second audio in a recorded manner based on the dialog text; and splicing the first audio and the second audio to obtain the broadcasting audio corresponding to the literary works. Determining dialogue information and bystander information in the text information; the characteristic that the dialogue text has a thick emotion expression is fully considered, the second audio is obtained in a recording mode, the fullness of the emotion expression of the second audio corresponding to the dialogue text is guaranteed, and the fullness and the generation efficiency of the emotion expression in the broadcasting audio are considered.

Description

Text information processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of natural language technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing text information.
Background
With the development of internet technology, users are increasingly listening to novel contents in the form of audio books.
In the related art, the artificial neural network can be used for converting the text of the novel into the audio, so that the generation of the voice book of the novel is realized.
However, the above-mentioned novel voice book generation mode is relatively single.
Disclosure of Invention
The application provides a text information processing method, device, equipment and storage medium, wherein the technical scheme is as follows:
according to an aspect of the present application, there is provided a method for processing text information, the method including:
acquiring a literary work, wherein the literary work is natural language information comprising a plurality of characters;
determining dialogue text and side text in the literature, wherein the dialogue text is text for initiating communication by at least one role, and the side text is information except the dialogue text in the literature;
generating a first audio in a speech synthesis manner based on the bystander; and generating a second audio in a recorded manner based on the dialog text;
and splicing the first audio and the second audio to obtain the broadcasting audio corresponding to the literary work.
According to another aspect of the present application, there is provided a text information processing apparatus including:
the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a literary work which is natural language information comprising a plurality of characters;
the processing module is used for determining dialogue texts and side texts in the literary works, wherein the dialogue texts are texts initiated by at least one role, and the side texts are information except the dialogue texts in the literary works;
the generation module is used for generating first audio in a voice synthesis mode based on the bystander text; and generating a second audio in a recorded manner based on the dialog text;
and the processing module is also used for splicing the first audio and the second audio to obtain the broadcasting audio corresponding to the literary works.
According to another aspect of the present application, there is provided a computer apparatus including a processor and a memory having stored therein at least one instruction, at least one program, a set of codes or a set of instructions, the at least one instruction, the at least one program, the set of codes or the set of instructions being loaded and executed by the processor to implement the method of processing text information as described in the above aspect.
According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions, the at least one instruction, the at least one program, the set of codes or the set of instructions being loaded and executed by a processor to implement the method of processing text information as described in the above aspect.
According to another aspect of the present application, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium, from which a processor reads and executes the computer instructions to implement the method of processing text information as described in the above aspect.
The beneficial effects that this application provided technical scheme brought include at least:
by determining dialogue information and bystander information in the text information, different types of text contents in the text information are distinguished; the characteristic that the dialogue text has a thick emotion expression is fully considered, the second audio is obtained in a recording mode, and the fullness of the emotion expression of the second audio corresponding to the dialogue text is ensured; the bypass text is text content for describing events and episodes, and the first audio is quickly generated in a voice synthesis mode, so that the efficiency of audio acquisition is improved; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a computer system provided in an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of a method for processing text information provided in an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a method of processing text information provided by an exemplary embodiment of the present application;
FIG. 4 is a flowchart of a method for processing text information provided by an exemplary embodiment of the present application;
FIG. 5 is a flowchart of a method for processing text information provided by an exemplary embodiment of the present application;
FIG. 6 is a flowchart of a method for processing text information provided by an exemplary embodiment of the present application;
FIG. 7 is a flowchart of a method for processing text information provided by an exemplary embodiment of the present application;
FIG. 8 is a flowchart of a method for processing text information provided by an exemplary embodiment of the present application;
FIG. 9 is a block diagram of a text message processing apparatus provided in an exemplary embodiment of the present application;
fig. 10 is a block diagram of a server according to an exemplary embodiment of the present application.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions. For example, information such as literary works referred to in this application is obtained with sufficient authorization.
It should be understood that, although the terms first, second, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
FIG. 1 illustrates a schematic diagram of a computer system provided in one embodiment of the present application. The computer system may implement a system architecture that becomes a method of processing text information. The computer system may include: a terminal 100 and a server 200.
The terminal 100 may be an electronic device such as a mobile phone, a tablet computer, a vehicle-mounted terminal (car), a wearable device, a PC (Personal Computer ), or the like. The terminal 100 may be provided with a client for running a target application, which may be a text information processing application or another application provided with a text information processing function, which is not limited in this application. In addition, the Application is not limited to the form of the target Application, and may be a web page, including but not limited to an App (Application), an applet, etc. installed in the terminal 100.
The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The server 200 may be a background server of the target application program, and is configured to provide a background service for a client of the target application program.
According to the text information processing method provided by the embodiment of the application, an execution main body of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capabilities. Taking the implementation environment of the solution shown in fig. 1 as an example, the processing method of the text information may be executed by the terminal 100 (for example, the processing method of the text information may be executed by a client terminal that installs the running target application program in the terminal 100), the processing method of the text information may be executed by the server 200, or the processing method of the text information may be executed by the terminal 100 and the server 200 in an interactive and coordinated manner, which is not limited in this application.
In addition, the technical scheme can be combined with the blockchain technology. For example, some of the data (such as direction information of the first road and the second road) involved in the text information processing method disclosed in the present application may be saved on the blockchain. Communication between the terminal 100 and the server 200 may be performed through a network, such as a wired or wireless network.
Fig. 2 is a schematic diagram of a method for processing text information according to an embodiment of the present application.
Acquiring a literary work 300, the literary work 300 being natural language information including a plurality of characters; illustratively, the literary composition 300 is a partial or complete segment of a novel text;
invoking the natural language model 310 to perform information recognition on the literary work 300 to obtain at least one sentence tag 311, wherein the at least one sentence tag 311 corresponds to at least one natural sentence in the literary work 300 one by one, and the sentence tag 311 divides the literary work 300 into a bystander text 301 and a dialogue text 302;
if the first natural sentence corresponding to the sentence tag 311 is the dialogue text 302, adding dialogue identifiers, such as label-roll, at the starting position and the ending position of the first natural sentence in the literary work 300; in this embodiment, the first natural sentence attributed to the dialog text 302 is: "this thing must be looked up exactly to-! ". Accordingly, other natural sentences to which a dialogue identifier is not added are attributed to the bystander 301. The dialogue text is text initiated by at least one role, and the bystander text is information except the dialogue text in the literary works;
The natural language model 310 is invoked to perform speaking character recognition on the context of the dialog text 302, so as to obtain a speaking character 313 of the dialog text, wherein the speaking character 313 corresponds to character information including at least one of gender, age and character characteristics. In one example, the character information is obtained by invoking the natural language model 310 to perform character recognition on the literary work 300.
And screening and determining a first recording party 321 from the plurality of candidate recording parties 320 according to the role information corresponding to the speaking role 313, wherein the first recording party 321 meets at least one of the following conditions: the tone sex of the first recording party 321 is the same as the sex of the speaking character 313, the tone age of the first recording party 321 is the same as the age of the speaking character 313, and the tone characteristic of the first recording party 321 is the same as the character characteristic of the speaking character 313.
Obtaining second audio 402 of the first recording side 321 reading the dialogue text 302 in a recording mode; and performing speech synthesis on the bystander 301 to generate a first audio 401.
Invoking the natural language model 310 performs text segmentation on the literary work 300 resulting in a paragraph tag 312 and adding a paragraph identifier, such as 00003-00004, in the literary work 300 based on the paragraph tag 312. In this embodiment, based on text segmentation, dialog text and bystander text are assigned to different paragraphs;
The naming information of the first audio 401 and the naming information of the second audio 402 are determined based on the paragraph tag 312. Specifically, the naming information of the first audio 401 carries a first paragraph identifier corresponding to the bystander text 301, and the naming information of the second audio 402 carries a second paragraph identifier corresponding to the dialog text 302.
Invoking the natural language model 310 to execute sound effect prediction on the bystander text 301 in the literature 300 to obtain a sound effect name 314 of the bystander text 301; the sound effect name 314 has corresponding sound effect audio 403; the audio 403 may be audio found in an existing audio library, or may be audio predicted based on the audio name 314, which will be described separately by way of example. An audio 403 is superimposed on the first audio 401, and an audio is added to the first audio 401. For example, the fifth sentence of literary works "at this time, a worker slaps the desk after silencing for a while, and very serious speaking the sentence" is the sound effect of overlapping the desk on the audio of the above-mentioned bystander text.
Based on the naming information of the first audio 401 and the second audio 402, the first audio 401 and the second audio 402 are spliced into corresponding broadcast audio 410 of the literary work 300.
Invoking the natural language model 310 to perform scene prediction on the literature 300 to obtain scene description information 315; the natural language model 310 extracts scene information implicitly carried in the literary work 300, and outputs scene description information 315 for the scene in a natural language manner.
And calling the image prediction model 350 to perform image prediction on the scene description information 315 to obtain the picture-inserting image 420 of the literature 300, so as to realize visual presentation of the literature.
Fig. 3 is a flowchart illustrating a method for processing text information according to an exemplary embodiment of the present application. The method may be performed by a computer device. The method comprises the following steps:
step 510: obtaining a literary work;
the literary work is natural language information including a plurality of characters; the literary works can be any text segment or all text in any text form such as articles, novels, newspapers and the like. In various embodiments of the present application, novel text is generally used as an example, but the various embodiments are not limited to application to the processing of other works of text.
Step 520: determining dialogue text and bystander text in literary works;
illustratively, the literary work comprises a plurality of natural sentences, and the dialogue text and the bystander text are two mutually independent parts in the literary work; illustratively, the dialog text includes at least one natural sentence and the bypass text includes at least one natural sentence.
Illustratively, the dialog text is text of at least one character initiating communication, the character being described in a literary work, a character in the literary work, such as at least one of a character, an anthropomorphic animal, a cartoon character, an avatar character dubbed in the literary work, and the like. By way of example, a bystander is information in a literary work other than dialog text. Further, the bypass text is used to describe at least one of a scene, character characteristics, character relationships, event episodes, etc. in the written work. Further, the dialogue text is text that initiates communication in the first person, and the bystander text is descriptive text in the third person's bystander view.
Illustratively, the dialog text and the bystander text in the literary works can be predicted based on an artificial neural network (Artificial Neural Networks, ANN) or classified based on format information of different natural sentences in the literary works. This application is not limited in this regard.
Step 530: generating a first audio in a speech synthesis manner based on the bystander; and generating a second audio in a recorded manner based on the dialog text;
illustratively, speech synthesis is a synthesis technique that converts a literary work into audio information based on an artificial neural network model to achieve a simulation of the speakable audio of the bystander text. Illustratively, the pronunciation of each character in the bystander is analyzed, the bystander is converted into a first audio matched with the bystander, and the first audio is the audio of the speakable bystander predicted by the artificial neural network model.
The second audio is illustratively presented by the recorder as the audio recording of the dialogue text is presented by the recorder. Illustratively, the speaking character corresponding to the dialogue text is dubbed by the recording party, which may be a dubbing actor.
Step 540: splicing the first audio and the second audio to obtain broadcasting audio corresponding to the literary works;
the broadcasting audio is the reading audio of the literary works, and the text content in the literary works is presented in an audio mode. Such as a voice book (also called a voice album, an audio album, etc.) where the broadcast audio is a novice work.
The first audio and the second audio are spliced based on the positions of the dialogue text and the bystander text in the literary works, so that the word order of the broadcast audio is ensured to be the same as the word order of the literary works.
In summary, according to the method provided by the embodiment, the dialogue information and the bystander information are determined in the text information, so that different types of text contents in the text information are distinguished; the characteristic that the dialogue text has a thick emotion expression is fully considered, the second audio is obtained in a recording mode, and the fullness of the emotion expression of the second audio corresponding to the dialogue text is ensured; the bypass text is text content for describing events and episodes, and the first audio is quickly generated in a voice synthesis mode, so that the efficiency of audio acquisition is improved; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered.
Fig. 4 is a flowchart illustrating a method for processing text information according to an exemplary embodiment of the present application. The method may be performed by a computer device. That is, in the embodiment shown in fig. 3, step 520 may be implemented as steps 522, 524:
step 522: calling a natural language model to perform information identification on the literary works to obtain at least one statement tag;
at least one sentence label corresponds to at least one natural sentence in the literary works one by one, and the sentence label is used for indicating that the corresponding natural sentence belongs to a dialogue text or a bystander text; illustratively, some or all of the natural sentences in the literary work correspond to sentence tags. In one example, sentence tags are used to indicate that the corresponding natural sentence is attributed to dialogue text and that the natural sentence without sentence tags is attributed to bystander text.
Illustratively, the natural language model is an artificial neural network with dialogue text and bystander text classification capabilities. In one example, the natural language model is a model trained from tasks based on dialogue text and bystander classification; in another example, the natural language model is a large language model (Large Language Model, LLM), and the LLM model is instructed to perform tasks of dialogue text and bystander text classification based on the task prompt text, and information recognition is performed on literary works input to the LLM model, resulting in at least one sentence tag.
Step 524: based on at least one sentence label, adding an identifier into the literary works to obtain a dialogue text and a bystander text;
illustratively, the identifier is used to indicate the location of the dialog text and/or the bypass text in the literary work. In one example, an identifier is inserted at any location of a natural sentence in a literary work, indicating that the natural sentence is attributed to dialogue text or bystander text. To avoid the insertion of identifiers that would disrupt the semantics of the natural language sentence, the identifiers are typically inserted at the beginning and/or ending positions of the natural language sentence.
In an alternative implementation, this step may be implemented as at least one of the following two sub-steps:
in the case where the first sentence tag indicates that the corresponding first natural sentence is a dialogue text, adding a dialogue identifier to a start position and an end position of the first natural sentence in the literary work;
in the case where the second sentence tag indicates that the corresponding second natural sentence is a bystander, a bystander identifier is added to the starting position and the ending position of the second natural sentence in the literary work.
Taking a dialogue identifier as an example, the dialogue identifier is </label-role >, the dialogue identifier is inserted before the first character of the first natural sentence, and the dialogue identifier is inserted after the end symbol (such as a period) of the first natural sentence.
In one example, if at least two adjacent natural sentences both belong to a dialog text, a dialog identifier may be inserted before the first character of the first sentence in the at least two adjacent natural sentences and after the end symbol of the last sentence in the at least two adjacent natural sentences. A plurality of natural sentences are identified that are adjacent based on the two dialog identifiers.
Illustratively, by adding an identifier to the literary work, dialog text and bystander text corresponding to different types are obtained at the literary work based on whether the identifier is present or different identifier styles. In one example, the two sub-steps are alternatively performed, and in the literary work, the dialogue text and the bystander text are obtained based on whether there is a distinction of the identifiers. In another example, the two sub-steps described above are performed together, and in the literary work, the dialog text and the bystander text are derived based on the difference in the dialog identifier and the bystander identifier patterns.
In summary, according to the method provided by the embodiment, the text information is marked by adding the identifier, so that different types of text contents in the text information are distinguished; the characteristic that the dialogue text has a thick emotion expression is fully considered, the second audio is obtained in a recording mode, and the fullness of the emotion expression of the second audio corresponding to the dialogue text is ensured; the bypass text is text content for describing events and episodes, and the first audio is quickly generated in a voice synthesis mode, so that the efficiency of audio acquisition is improved; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered.
Fig. 5 shows a flowchart of a method for processing text information according to an exemplary embodiment of the present application. The method may be performed by a computer device. I.e. on the basis of the embodiment shown in fig. 3, further comprising step 525, step 526, step 527, step 530 may be implemented as step 530a:
step 525: calling a natural language model to execute character recognition on the literary works to obtain character information of at least one character in the literary works;
illustratively, the literary composition includes at least one character that participates in an event described in the literary composition. Similar to the above, the character in the literature may be at least one of a character, an anthropomorphic animal, a cartoon character, a dummied avatar, and the like.
Character information of a character describes global features extracted from text information, and exemplary character information includes at least one of the following information: character name, gender, age, character characteristics. In one example, taking a role as a virtual character of the crafted person as an example, the character information for describing the characteristic character of the virtual character globally existing in the literary work may be explicitly recorded in the literary work, or may be information implicitly carried in the literary work, which is extracted from the literary work. Such as: literature works are described in: zhang Sanshi has just graduation and has graduation travel with good brothers in dormitory. The content implicitly carries the sex of the virtual character Zhang three as male and the age as young. The dialogue text space of the third person in the literary works is short compared with the space of other virtual characters, and the character characteristics of the third person in the literary works are silent oligopolitics.
Illustratively, the natural language model is an artificial neural network with character recognition capabilities. In one example, the natural language model is a model trained based on tasks of character recognition; in another example, the natural language model is a LLM model, and the LLM model is instructed to perform a task of character recognition based on the task prompt text, and information recognition is performed on literary works input to the LLM model, resulting in character information of at least one character. For example, the task prompt text is used for finding out characters appearing in the following text, and summarizing the character's character name, sex, age and character characteristics.
Step 526: calling a natural language model to execute speaking role recognition on the context of the dialogue text to obtain the speaking role of the dialogue text;
the context of the dialog text includes, for example, at least one natural sentence preceding and/or following the dialog text. The context of the dialogue text carries the role name of the speaking role, and the context of the dialogue text has the meaning of the speaking role initiating communication; such as: zhang Sanzhen inquiring the channel, zhang Sanzhen slowly writing on letter paper, zhang Sanzhu busy shouting, etc.
Illustratively, the natural language model is an artificial neural network with speech character recognition capabilities. In one example, the natural language model is a LLM model. For example, the natural language model can extract semantics of the context of the dialog text, determine whether semantics of the talk character initiation communication exist, and extract talk characters described in the context of the dialog text if semantics of the initiation communication exist. Illustratively, at least one of the roles in the literary composition obtained in step 525 includes a speaking role.
Further, the LLM model in the present embodiment and the LLM model in step 522 have the same model parameters.
Step 527: invoking a natural language model to execute emotion recognition on the dialogue text to obtain a dialogue emotion label of the dialogue text;
illustratively, the emotion markup language indicates the emotion of the dialog text, such as describing the emotion of the speaking character when communicating based on the dialog text. Illustratively, emotion tags are extracted from dialog text. In some examples, obtaining emotion tags also requires performing emotion recognition on the context of the dialog text, such as performing emotion recognition on the dialog text and the context of the dialog text.
For example, the emotion recognition and the character information in the above text are characterized by mutually independent labels, and the table usually does not have a mutually constrained relationship, for example, one character is a silent oligopolistic virtual character, and the emotion labels of the dialogue text in different contexts can be happy, difficult or excited, and are not mutually limited.
Illustratively, the natural language model is an artificial neural network with emotion recognition capabilities. In one example, the natural language model is a LLM model. Illustratively, the natural language model can extract semantics of the dialog text and summarize dialog emotion tags of the dialog text. In one example, the number of characters in the dialogue text is large, and the first recording party corresponding to the second audio can quickly know the emotion state of the dialogue text based on the emotion tag so as to adjust emotion of the dialogue text when recording the second audio, so that emotion of the dialogue text is read, and the emotion of the dialogue text is quickly brought into the speaking role is read; the recording efficiency of the second text is improved.
Illustratively, the audio semantic content of the second audio is dialog text and the dialog emotion tag provides an emotion reference for the second audio recording process. Further, the dialogue text carries dialogue emotion labels, and emotion references are provided for a first recording party by showing the dialogue emotion labels to the first recording party for recording the second audio.
For example, the execution timing between any two steps in steps 525, 526, 527 in this embodiment is not limited, and any two steps in the three steps may be executed sequentially in any manner, or may be executed simultaneously; in the case where at least two steps are performed simultaneously, there may be a plurality of natural language models connected in parallel to achieve the simultaneous invocation of the plurality of natural language models.
Step 530a: generating a first audio in a speech synthesis manner based on the bystander; the second audio of the dialogue text read by the first recording party in a recording mode is obtained;
for an exemplary description of the first audio, please refer to step 530 above.
The second audio is an audio of the first recording party reading the dialogue text, and in this embodiment, the first recording party satisfies at least one of the following: the gender of the tone color of the first recording party is the same as the gender of the speaking character, the age of the tone color of the first recording party is the same as the age of the speaking character, and the tone color characteristic of the first recording party is the same as the character characteristic of the speaking character.
Illustratively, the first party may be a dubbing actor, and the tone characteristic is used to describe a sound tone characteristic of the first party based on tone gender, tone age, and tone characteristics. It will be appreciated that the first record may correspond to one or more of age of timbre, one or more of timbre characteristics, one or both of timbre gender. This application is not limited in this regard.
In an alternative implementation, before step 530a, further includes:
determining target recording parties corresponding to each character one by one in a plurality of candidate recording parties based on character information of at least one character in literary works;
and searching a first recording party corresponding to the speaking role from the plurality of candidate recording parties based on the speaking role.
Each character is illustratively provided with a unique target recording party, so that the dialogue text initiated by one character in the literary works is dubbed by the unique recording party, and the tone consistency of the dialogue text of the character is ensured. Illustratively, one record side corresponds to one or more roles, which is not limiting in this application.
Illustratively, the target record party for each role one-to-one satisfies at least one of: the tone color gender of the target recording party is the same as the gender of the character, the tone color age of the target recording party is the same as the age of the character, and the tone color characteristics of the target recording party are the same as the character characteristics of the character.
Illustratively, the plurality of candidate recorders includes a first recorder; based on the association between the candidate record party and the roles in the literary works; and searching the speaking role to obtain a first recording party with an association relationship with the speaking role.
In summary, according to the method provided by the embodiment, the dialogue information and the bystander information are determined in the text information, so that different types of text contents in the text information are distinguished; the characteristic that the dialogue text has a thick emotion expression is fully considered, the second audio is obtained in a recording mode, the first recording party of the second audio and the speaking object have the same gender, age and tone characteristics, the fullness of the emotion expression of the second audio corresponding to the dialogue text is ensured, and the first recording party can provide tone with immersion sense for the speaking role to dug; the bypass text is text content for describing events and episodes, and the first audio is quickly generated in a voice synthesis mode, so that the efficiency of audio acquisition is improved; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered.
In an alternative example, on the basis of the embodiment shown in fig. 5, the following two steps are further included:
Invoking a natural language model to execute the recognition of the bystander view to obtain the description role of the bystander;
illustratively, the natural language model is an artificial neural network with bystander view recognition capabilities. In one example, the natural language model is a LLM model. By way of example, the bystander may be a point of view of a character in the literary composition; for example, the bystander text includes: "teacher looks at the training problem of Zhang three, think this student study attitude very right, but understand the knowledge point not deep enough, need to continue training. The above-mentioned bystander is a view of the Zhang San learning event from the viewpoint of the teacher character. In the case where the bystander is a point of view of one character in the literature, the descriptive character is a character name in the literature.
The natural language model may perform the recognition of the angle of view by using at least one of the context of the white-out text and the context section of the literary work together with the white-out text. The description roles are extracted from more text information, so that the problem that the description roles cannot be identified under the condition that the description roles are carried when the current chapter of the literary work is not direct is avoided.
In one example, the first section of the literature describes a teacher in the scholars' favorite recitation of ancient poems, which the teacher gives the students a profound impression of in the classroom professor. And only one ancient poem is recorded in the side text of the current chapter, and the description role of the side text cannot be determined only by means of the current chapter. The natural language model performs the recognition of the bystander visual angle on the upper and lower chapters of the literary works and the bystander text together, so that the description role of the bystander text can be obtained as a teacher in the scholars, the tone of the teacher in the scholars is used for correcting the audio corresponding to the ancient poems, an audio environment with immersion can be provided, and the thinking of the scholars is expressed.
By way of example, the bystander text may also be text that is used only to describe the event scenario; for example, the bystander text includes: the white text is an objective description of the literary composition plot and is not a specific view under the view angle of a role. In the case where the bystander is a text for describing only the event scenario, the description character is null.
Modifying the first audio based on the timbre of the second audio in the case that the descriptive character and the speaking character are the same;
Illustratively, in the case where the descriptive character and the speaking character are the same, the second audio is recorded audio provided by the recording party with a rich emotional expression. The first audio is corrected based on the tone of the second audio, so that the dialogue text of the same role in the literary works and the side text expressing the role view can be guaranteed to have the same tone, and the immersion sense of the audio broadcast by the literary works is further improved.
Further, due to the different properties of the white text and the dialogue text, the dialogue text often has the emotion of a character, such as shouting three times, slow speaking three times, a call for three times, etc.; bearing different emotions in different contexts. The white-off information is a role view, and different text contents do not have differences in mood and emotion. It can be seen that the emotion information carried by the white text is less than that carried by the dialog text. Further correcting the first audio synthesized by voice can avoid the problem that the recording party needs to record a large amount of audio to influence the generation efficiency of the broadcast audio.
In summary, according to the method provided by the embodiment, the dialogue information and the bystander information are determined in the text information, so that different types of text contents in the text information are distinguished; the bystander is the text content used for describing events and episodes, and the first audio is quickly generated in a voice synthesis mode; correcting the first audio according to the tone of the recording party, ensuring that the view under the visual angle of the descriptive character and the dialogue text have the same tone, further improving the immersion of the audio broadcasted by the literary works and improving the efficiency of audio acquisition; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered.
Fig. 6 shows a flowchart of a method for processing text information according to an exemplary embodiment of the present application. The method may be performed by a computer device. I.e. on the basis of the embodiment shown in fig. 3, further comprising step 515, step 535, step 536, step 540 may be implemented as step 542:
step 515: calling a natural language model to execute text segmentation on the literary works to obtain paragraph labels, and adding paragraph identifiers based on the paragraph labels in the literary works;
illustratively, the natural language model is an artificial neural network with text segmentation. In one example, the natural language model is a LLM model. Illustratively, the natural language model separates different natural sentences into different paragraphs; wherein the dialog text and the bystander text belong to different paragraphs. Illustratively, a paragraph tag is used to indicate an end position of at least one natural sentence, segment the literary work with the end position of the natural sentence indicated by the paragraph tag, and add a paragraph identifier.
In one example, in connection with the corresponding embodiment of fig. 2, the literary work includes the following characters:
"chapter three: you look at the thief-!
At this point a group of workers is looking at the picture coming from the living room. All people are looking tightly at the screen. It is well known to the theft shown in video. "this thing must be looked up exactly to-! "at this time, a worker, after having silenced for a while, very serious uttered the sentence. "
Adding paragraph identifiers based on paragraph labels in literary works to obtain:
"00003-00000 |$" third chapter $: you look at the thief-!
00003-00001 the group of staff is now looking at the picture coming from the live broadcasting room.
00003-00002| all closely look at the screen.
00003-00003| are very clear to the theft shown in the video.
00003-00004| "this must be strictly looked up to the bottom-! "
00003-00005. At this time, a worker slams the table after a while, and very serious speaks the sentence. "
In one example, a paragraph identifier is used to indicate that the natural sentence belongs to sequential paragraph numbers in the literary work, and further, a paragraph indicator also indicates that there is a chapter number for the literary work. In one example, one paragraph identifier in a literary work is: "00003-00004"; wherein the first half "00003" is used to indicate that the natural sentence belongs to the third chapter of the literary work, the second half "00004" is used to indicate the fourth paragraph of the sequential paragraph number of the natural sentence, and the symbol "|" is used to indicate that the paragraph identifier ends. "$$ means that the paragraph is chapter names of literary works.
Step 535: constructing a text pronunciation sequence of the dialogue text based on the dialogue text;
illustratively, the pronunciation sequence is used for indicating the pronunciation mode of the dialogue text in the first language; characters included in the dialog text are assigned to a first language. And constructing a text pronunciation sequence by searching the pronunciation mode of each character in the dialogue text in the first language.
In an alternative implementation, step 535 may be implemented as:
performing text regularization on the dialogue text, converting at least one of digital information, date information and currency information in the dialogue text into a first language text, and constructing a text pronunciation sequence of the first language text;
the method includes the steps that at least one of the digital information, the date information and the currency information has different pronunciation modes in different languages, the information is converted into a first language text through text regularization, and the pronunciation modes of the digital information, the date information and the currency information are determined.
In one example, the dialog text includes: "2023/5/8I have a 50% probability of getting up at 6:30", the first language text resulting from performing text regularization includes: "fifty percent of I'm have a probability of getting up at six points and half a year in five months in two zero and three years", wherein the first language text corresponds to Chinese.
Step 536: calling an acoustic model to execute pronunciation splitting on the second audio to obtain pronunciation units corresponding to each audio frame in the second audio, and constructing an audio pronunciation sequence;
illustratively, the acoustic model has a pronunciation splitting capability, in one example, the acoustic model frames the second audio in sequence, and calculates a pronunciation unit for each audio frame in the second audio. Each audio frame illustratively corresponds to a duration of 10ms. The constructed audio pronunciation sequence is used for indicating the pronunciation mode of each audio frame in the second audio.
Illustratively, step 535 and step 536 are used to audit the recorded second audio; under the condition that the difference between the text pronunciation sequence and the audio pronunciation sequence is smaller than a preset threshold, the second audio passes the auditing, and step 542 is executed to realize the splicing of the broadcasting audio; that is, in this embodiment, the broadcast audio is obtained based on the first audio and the second audio when the difference between the text-to-sound sequence and the audio-to-sound sequence is smaller than the preset threshold.
For example, in the case where the difference between the text-to-sound sequence and the audio-to-sound sequence is less than a preset threshold, the second audio recording is unreliable in reading the dialog text and the audit is not passed. Further, as the lengths of the dialogue texts are different, the difference between the text pronunciation sequence and the audio pronunciation sequence is different relative proportion between the two sequences, so that the influence caused by the lengths of the dialogue texts is avoided, and the auditing accuracy is improved.
Step 542: splicing the first audio and the second audio into broadcasting audio based on naming information of the first audio and the second audio;
illustratively, the naming information of the first audio carries a first paragraph identifier corresponding to the bystander text, and the naming information of the second audio carries a second paragraph identifier corresponding to the dialogue text. The first paragraph identifier, the second paragraph identifier indicate a literal order in which the bypass text and the dialog text make up the literary work.
According to the sequence of the first paragraph identifier or the second paragraph identifier carried in the naming information from small to large, the first audio and the second audio are arranged, and the broadcasting audio is obtained by splicing, so that the word sequence of the broadcasting audio is ensured to be the same as the word sequence of the literary works.
Illustratively, the first audio corresponds to one or more natural sentences; in one example, where the first audio corresponds to a plurality of natural sentences, the plurality of natural sentences are adjacent in the literary work. Similarly, the second audio corresponds to one or more natural sentences, which will not be described in detail.
In one example, when the first audio and the second audio are spliced, the first audio and the second audio are audio subjected to a volume balancing process. Taking the first audio as an example, the reference volume threshold detects voiced segments in the first audio (e.g., segments below-30 dB are considered silence). In each voiced segment, the peak energy is set to a volume normalized value (ranging from 0 to 1, such as 0.5), and the other audio points in the voiced segment vary with the ratio of energy to peak energy. For example, the audio clip contains N audio points T1, T2, T3, … …, tn, the volume peak audio point is T100, the peak energy is 0.8, and the volume normalization value is preset, for example, 0.5. The volume values Vn of the other points are respectively adjusted as follows: vn=tn/0.8×0.5.
In one example, when the first audio and the second audio are spliced, the first audio and the second audio are dereverberated dry sound files. Illustratively, by invoking a dereverberation neural network, dereverberation processing is performed on the first audio and the second audio, respectively, resulting in a corresponding dry sound file.
It should be noted that, steps 515 and 542 in the present application may be combined into a new embodiment in combination with steps 510 to 530 in fig. 3 to be implemented separately, which is not limited in the present application.
In summary, according to the method provided by the embodiment, the dialogue information and the bystander information are determined in the text information, so that different types of text contents in the text information are distinguished; the characteristic that the dialogue text has a thick emotion expression is fully considered, the second audio is obtained in a recording mode, the first audio is rapidly generated in a voice synthesis mode, and the audio obtaining efficiency is improved; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered; and the segment numbers of the dialogue text and the side text are used as references, and the first audio and the second audio are spliced, so that the word sequence of the broadcast audio is ensured to be the same as the word sequence of the literary works.
Fig. 7 is a flowchart illustrating a method for processing text information according to an exemplary embodiment of the present application. The method may be performed by a computer device. I.e. on the basis of the embodiment shown in fig. 3, further comprises a step 528, a step 537:
step 528: calling a natural language model to execute sound effect prediction on the side text, and marking a sound effect name in the front of the side text;
illustratively, the natural language model is an artificial neural network with sound effect prediction capabilities. In one example, the natural language model is a LLM model. For example, the natural language model can convert words or sentences possibly sounding in the bystander text into descriptive words of sound according to the semantics of the bystander text, and then the sound effect name is obtained. Illustratively, the sound effect name is marked in front of the side text to prompt the sound effect corresponding to the side text.
In one example, in connection with the corresponding embodiment of fig. 2, the bystander text includes: "at this time, a worker slams a desk after silencing for a while, and very serious speaks the sentence", wherein the word "slam a desk" which may make a sound is included in the bystandings. The sound effect label in front of the bystander text carries a sound effect name, and the sound effect label is as follows: < label-effect effect = "slam-hammer table" time = (10.20, 15.00) >, wherein label-effect indicates that the tag type is an effect tag, effect = "slam-hammer table" indicates that the effect name is slam-hammer table, and time = (10.20, 15.00) indicates that the insertion time stamp of the effect is 10.20 seconds to 15.00 seconds.
Step 537: superposing the sound effect audio corresponding to the sound effect name in the first audio;
illustratively, the sound effect name corresponds to sound effect audio; in one example, the audio is found from an existing audio library based on the audio name; the existing audio library comprises at least two existing audios, and each existing audio has a labeling name. The sound effect name predicted by the natural language model is the same as the labeling name of the sound effect audio in the existing audio library or has the same semantic meaning.
In an alternative implementation, where the audio is predictively generated, step 537 may be implemented as:
executing text coding on the sound effect names, and extracting semantic features of the sound effect names;
sequentially executing audio coding and audio decoding on the semantic features to obtain audio corresponding to the audio names;
wherein, the semantic feature is a hidden layer feature representation of the sound effect name in a semantic hidden layer space; the audio encoding is used for extracting the hidden layer characteristic representation of the audio hidden layer space from the semantic characteristics, and the audio decoding is used for encoding the hidden layer characteristic representation of the audio hidden layer space to obtain audio information.
In one example, a text encoder is invoked to perform text encoding during prediction of audio effects, and an audio encoder-decoder is invoked to perform audio encoding and audio decoding; the above-described text encoder, audio encoder-decoder cascade model is an artificial neural network with audio generation capabilities, such as an audio generation (Audiogen) model. Illustratively, the encoded audio is audio that is not present in an existing audio library.
In one example, during training of the audio generation model, the text encoder, audio encoder-decoder make up a generation network; the audio generation model also includes a discrimination network, i.e., an audio classifier.
Wherein the function of the text encoder is to convert the entered text description into a vector representation of the text features. The audio encoder-decoder structure is in the form of a transducer, and the autoregressive decoding generates an output sequence of audio, i.e., predicted audio, by referencing the output of the text encoder over a time sequence.
In the training process of the audio generation model, the sample audio and the sample name are a training information pair, and based on the difference between the sample audio and the prediction audio generated by decoding, the network parameters of the generation network are adjusted as a loss function of the generation network.
Illustratively, the discrimination network is configured to perform classification on the predicted audio, determine whether the audio output by the generation network is real audio or the audio generated by the generation network, and calculate a loss value of the classification as the loss value of the discrimination network. The loss values of the generating network and the judging network are mutually opposed, the opposed training is carried out, the audio generating model is further trained, the audio generating model obtained through training only comprises the generating network, and the generating network is guaranteed to have the capability of generating audio files according to text input.
In summary, according to the method provided by the embodiment, the dialogue information and the bystander information are determined in the text information, so that different types of text contents in the text information are distinguished; the bypass text is text content for describing events and episodes, and the first audio is quickly generated in a voice synthesis mode, so that the efficiency of audio acquisition is improved; the audio corresponding to the audio name is superimposed in the first audio, so that the audio effect of the first audio is enriched, and the immersive audio effect is provided; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered.
Fig. 8 is a flowchart illustrating a method for processing text information according to an exemplary embodiment of the present application. The method may be performed by a computer device. I.e. on the basis of the embodiment shown in fig. 3, further comprises a step 517, a step 518:
step 517: calling a natural language model to execute scene prediction on the literary works to obtain scene description information corresponding to the literary works;
the scene description information is obtained by predicting a natural language model and is descriptive of scenes in the literary works. It should be noted that, due to the complexity of the literary work, the scene of the literary work may not be directly described, but is implicitly described through the ways of expressing emotion, etc. by describing the relationship between roles and the psychological activities of the diagonal colors. For example, literature works are described as follows: "Zhang Sanand Lisi" are friends with good relationship. Zhang Sansheng after the disease, the colleagues go to the hospital to look at him; on that day, the four of the plum happens to be present, and the students know that the four of the plum is still looking after the three. Although specific scenes of the third-party disease in the hospital are not directly described in the literary works, such as ' Zhang Sanhe lies on a sickbed and Lisi is accompanied by Zhang San ' at the bedside ', the third-party patient can be predicted by the friend relation between Zhang San and Lisi and the description of Lisi in the care of Zhang San, lisi is a person accompanied by Zhang San, and the colleague is a person looking for Zhang San and other scene description information. Based on the records of the sickness in the hospitals in the literary works, the scene description information such as beds, medical equipment, transfusion devices and the like in the ward can be predicted.
The natural language model has scene prediction capability, extracts scene information implicitly carried in literary works, and outputs scene description information of the scene in a natural language mode.
Step 518: image prediction is carried out on the scene description information by calling an image prediction model, so as to obtain an illustration image of the literature work;
the image prediction model extracts semantic information in the scene description information, encodes the scene description information into feature representation of hidden layer space, then decodes the feature representation, converts the scene description information into image information, and predicts the picture-inserting image. Illustratively, the image prediction model is an artificial neural network including, but not limited to, at least one of text-generated Images (Drawing Attention to Text-based Images), contrast language-image pre-training (Contrastive Language-Image Pretraining, CLIP), attention-generating countermeasure networks (Attention Generative Adversarial Network, attnGAN).
In an alternative implementation, the method further comprises the following two steps:
acquiring an upper text positioned before the literary works, wherein the upper text and the literary works belong to the same article;
Invoking a natural language model to jointly execute scene recognition on the text and the literary works to obtain scene labels in articles to which the literary works belong and out-of-scene roles appearing in the scenes;
illustratively, the above text and the literary composition have continuity in scenario episodes, which can provide more background scenarios for the literary composition. In one example, the literature only describes the scene of fighting between two characters, and the two characters occupy the whole literature because of the longer space of fighting, while the above text of the literature describes the specific place where the two characters take the fighting, which provides more information sources for extracting scene description information.
The natural language model performs scene recognition by extracting all the characters appearing in the text and literary works, and extracting the scenes where the characters are located.
Accordingly, step 518 may be implemented as:
and calling a natural language model to execute scene prediction on the literary works, the scene labels and the attendance roles, so as to obtain scene description information corresponding to the literary works.
Exemplary, the scene tag and the out-of-scene character provide constraints for scene prediction, limit the scene types and characters in the scene to be described by the scene description information, and avoid omission of the out-of-scene character and deviation of the scene types. Illustratively, the literary composition provides visual details of scene types and out-of-field characters, such as at least one of information of the scenes in the coarse scene, the pose, position, action, etc.
In summary, according to the method provided by the embodiment, the dialogue information and the bystander information are determined in the text information, so that different types of text contents in the text information are distinguished; the second audio is obtained in a recording mode, and the first audio is rapidly generated in a voice synthesis mode, so that the audio obtaining efficiency is improved; the generation mode of the broadcasting audio corresponding to the text information is enriched, and the fullness and the generation efficiency of emotion expression in the broadcasting audio are considered. The method has the advantages that the picture-inserting images are predicted based on the scene description information carried in the literary works, semantic information in the literary works is fully utilized when the scene description information is extracted, and visual presentation of the literary works is realized.
It will be understood by those skilled in the art that the foregoing embodiments may be implemented independently, or the foregoing embodiments may be freely combined to form a new embodiment to implement the text information processing method of the present application.
Fig. 9 is a block diagram showing a configuration of a text information processing apparatus according to an exemplary embodiment of the present application. The device comprises:
an acquisition module 810 for acquiring a literary work, the literary work being natural language information including a plurality of characters;
A processing module 820, configured to determine a dialogue text and a bystander text in the literature, where the dialogue text is a text that initiates communication by at least one character, and the bystander text is information in the literature except the dialogue text;
a generating module 830, configured to generate a first audio in a speech synthesis manner based on the bystander; and generating a second audio in a recorded manner based on the dialog text;
the processing module 820 is further configured to splice the first audio and the second audio to obtain the broadcast audio corresponding to the literature.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
calling a natural language model to perform information identification on the literary works to obtain at least one sentence label, wherein the at least one sentence label corresponds to at least one natural sentence in the literary works one by one, and the sentence label is used for indicating that the corresponding natural sentence belongs to the dialogue text or the bystander text;
and adding an identifier to the literary work based on the at least one sentence label to obtain the dialogue text and the bystander text, wherein the identifier is used for indicating the position of the dialogue text and/or the bystander text in the literary work.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
adding a dialogue identifier at a starting position and an ending position of a first natural sentence in the literary work under the condition that the first sentence label indicates that the corresponding first natural sentence is the dialogue text;
and/or adding a bystander identifier at the starting position and the ending position of the second natural sentence in the literary work under the condition that the second sentence label indicates that the corresponding second natural sentence is the bystander text.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
calling a natural language model to execute character recognition on the literary works to obtain character information of at least one character in the literary works, wherein the character information comprises at least one of the following information: character name, sex, age, character characteristics;
calling a natural language model to execute speaking role recognition on the context of the dialogue text to obtain speaking roles of the dialogue text, wherein at least one role in the literature comprises the speaking roles;
the generating module 830 is further configured to:
Acquiring the second audio of the dialogue text read by a first recording party in a recording mode;
wherein the first recording party satisfies at least one of: the sex of the tone color of the first recording party is the same as the sex of the speaking character, the age of the tone color of the first recording party is the same as the age of the speaking character, and the characteristics of the tone color of the first recording party are the same as the characteristics of the speaking character.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
based on the character information of the at least one character in the literature work, determining target recording parties corresponding to each character one by one in a plurality of candidate recording parties;
and searching the first recording party corresponding to the speaking role in the plurality of candidate recordings Fang Zhongji.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
invoking a natural language model to execute emotion recognition on the dialogue text to obtain a dialogue emotion label of the dialogue text;
the audio semantic content of the second audio is the dialogue text, and the dialogue emotion tag provides emotion reference for the second audio recording process.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
calling a natural language model to execute text segmentation on the literary work to obtain paragraph labels, and adding paragraph identifiers based on the paragraph labels in the literary work, wherein the dialogue text and the bystander text belong to different paragraphs;
splicing the first audio and the second audio into the broadcasting audio based on the naming information of the first audio and the second audio;
the naming information of the first audio carries a first paragraph identifier corresponding to the bystander, and the naming information of the second audio carries a second paragraph identifier corresponding to the dialogue text.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
constructing a text pronunciation sequence of the dialogue text based on the dialogue text;
calling an acoustic model to execute pronunciation splitting on the second audio to obtain pronunciation units corresponding to each audio frame in the second audio, and constructing an audio pronunciation sequence;
and the broadcasting audio is obtained based on the first audio and the second audio by splicing under the condition that the difference between the text pronunciation sequence and the audio pronunciation sequence is smaller than a preset threshold.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
performing text regularization on the dialogue text, converting at least one of digital information, date information and currency information in the dialogue text into a first language text, and constructing a text pronunciation sequence of the first language text;
the first language text is a text belonging to a language corresponding to characters in the dialogue text.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
calling a natural language model to execute sound effect prediction on the bystander, and marking sound effect names in front of the bystander;
and superposing the sound effect audio corresponding to the sound effect name in the first audio.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
executing text coding on the sound effect name, and extracting to obtain semantic features of the sound effect name, wherein the semantic features are hidden feature representations of the sound effect name in a semantic hidden space;
and sequentially executing audio coding and audio decoding on the semantic features to obtain audio corresponding to the audio names, wherein the audio coding is used for extracting hidden layer feature representation of an audio hidden layer space from the semantic features, and the audio decoding is used for coding the hidden layer feature representation of the audio hidden layer space to obtain audio information.
In an alternative implementation of this embodiment, the processing module 820 is further configured to:
calling a natural language model to execute scene prediction on the literary works to obtain scene description information corresponding to the literary works;
and calling an image prediction model to execute image prediction on the scene description information to obtain the picture-inserting image of the literature.
In an optional implementation manner of this embodiment, the obtaining module 810 is further configured to:
acquiring a text above the literary work, wherein the text above and the literary work belong to the same article;
the processing module 820 is further configured to:
invoking the natural language model to execute scene recognition on the text and the literary works together to obtain scene labels in articles to which the literary works belong and emergent characters appearing in the scenes;
and calling the natural language model to execute scene prediction on the literary works, the scene labels and the out-of-the-field roles to obtain scene description information corresponding to the literary works.
It should be noted that, when the apparatus provided in the foregoing embodiment performs the functions thereof, only the division of the respective functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to actual needs, that is, the content structure of the device is divided into different functional modules, so as to perform all or part of the functions described above.
With respect to the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiments regarding the method; the technical effects achieved by the execution of the operations by the respective modules are the same as those in the embodiments related to the method, and will not be described in detail herein.
The embodiment of the application also provides a computer device, which comprises: a processor and a memory, the memory storing a computer program; the processor is configured to execute the computer program in the memory to implement the method for processing text information provided by the foregoing method embodiments.
Optionally, the computer device is a server. Illustratively, fig. 10 is a block diagram of a server provided in an exemplary embodiment of the present application.
In general, the server 2300 includes: a processor 2301 and a memory 2302.
The processor 2301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 2301 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 2301 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 2301 may be integrated with an image processor (Graphics Processing Unit, GPU) for use in connection with rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 2301 may also include an artificial intelligence (Artificial Intelligence, AI) processor for processing computing operations related to machine learning.
Memory 2302 may include one or more computer-readable storage media, which may be non-transitory. Memory 2302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 2302 is used to store at least one instruction for execution by processor 2301 to implement the method of processing text information provided by the method embodiments herein.
In some embodiments, server 2300 may further optionally include: an input interface 2303 and an output interface 2304. The processor 2301 and the memory 2302 may be connected to the input interface 2303 and the output interface 2304 through buses or signal lines. The respective peripheral devices may be connected to the input interface 2303 and the output interface 2304 through buses, signal lines, or a circuit board. Input interface 2303, output interface 2304 may be used to connect at least one Input/Output (I/O) related peripheral device to processor 2301 and memory 2302. In some embodiments, the processor 2301, memory 2302, and input interface 2303, output interface 2304 are integrated on the same chip or circuit board; in some other embodiments, the processor 2301, the memory 2302, and either or both of the input interface 2303 and the output interface 2304 may be implemented on separate chips or circuit boards, which are not limited in this application.
Those skilled in the art will appreciate that the structures shown above are not limiting of server 2300 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.
In an exemplary embodiment, a chip is also provided, which includes programmable logic circuits and/or program instructions for implementing the method of processing text information as described in the above aspects when the chip is run on a computer device.
In an exemplary embodiment, a computer program product is also provided, the computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor reads and executes the computer instructions from the computer readable storage medium to implement the method for processing text information provided by the above method embodiments.
In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein a computer program loaded and executed by a processor to implement the method of processing text information provided by the above-described method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (18)

1. A method for processing text information, the method comprising:
acquiring a literary work, wherein the literary work is natural language information comprising a plurality of characters;
determining dialogue text and side text in the literature, wherein the dialogue text is text for initiating communication by at least one role, and the side text is information except the dialogue text in the literature;
generating a first audio in a speech synthesis manner based on the bystander; and generating a second audio in a recorded manner based on the dialog text;
and splicing the first audio and the second audio to obtain the broadcasting audio corresponding to the literary work.
2. The method of claim 1, wherein the determining dialogue text and bystanders in the literary work comprises:
calling a natural language model to perform information identification on the literary works to obtain at least one sentence label, wherein the at least one sentence label corresponds to at least one natural sentence in the literary works one by one, and the sentence label is used for indicating that the corresponding natural sentence belongs to the dialogue text or the bystander text;
And adding an identifier to the literary work based on the at least one sentence label to obtain the dialogue text and the bystander text, wherein the identifier is used for indicating the position of the dialogue text and/or the bystander text in the literary work.
3. The method of claim 1, wherein the adding an identifier to the literary work based on the at least one statement tag comprises:
adding a dialogue identifier at a starting position and an ending position of a first natural sentence in the literary work under the condition that the first sentence label indicates that the corresponding first natural sentence is the dialogue text;
and/or adding a bystander identifier at the starting position and the ending position of the second natural sentence in the literary work under the condition that the second sentence label indicates that the corresponding second natural sentence is the bystander text.
4. A method according to any one of claims 1 to 3, wherein the method further comprises:
calling a natural language model to execute character recognition on the literary works to obtain character information of at least one character in the literary works, wherein the character information comprises at least one of the following information: character name, sex, age, character characteristics;
Calling a natural language model to execute speaking role recognition on the context of the dialogue text to obtain speaking roles of the dialogue text, wherein at least one role in the literature comprises the speaking roles;
the generating the second audio based on the dialogue text in a recording mode comprises the following steps:
acquiring the second audio of the dialogue text read by a first recording party in a recording mode;
wherein the first recording party satisfies at least one of: the sex of the tone color of the first recording party is the same as the sex of the speaking character, the age of the tone color of the first recording party is the same as the age of the speaking character, and the characteristics of the tone color of the first recording party are the same as the characteristics of the speaking character.
5. The method according to claim 4, wherein the method further comprises:
based on the character information of the at least one character in the literature work, determining target recording parties corresponding to each character one by one in a plurality of candidate recording parties;
and searching the first recording party corresponding to the speaking role in the plurality of candidate recordings Fang Zhongji.
6. The method according to claim 4, wherein the method further comprises:
Calling the natural language model to execute the recognition of the white-place visual angle on the white-place text to obtain the description role of the white-place text;
and correcting the first audio based on the tone color of the second audio under the condition that the description role and the speaking role are the same.
7. A method according to any one of claims 1 to 3, wherein the method further comprises:
invoking a natural language model to execute emotion recognition on the dialogue text to obtain a dialogue emotion label of the dialogue text;
the audio semantic content of the second audio is the dialogue text, and the dialogue emotion tag provides emotion reference for the second audio recording process.
8. A method according to any one of claims 1 to 3, wherein the method further comprises:
calling a natural language model to execute text segmentation on the literary work to obtain paragraph labels, and adding paragraph identifiers based on the paragraph labels in the literary work, wherein the dialogue text and the bystander text belong to different paragraphs;
the step of splicing the first audio and the second audio to obtain the broadcasting audio corresponding to the literary works comprises the following steps:
Splicing the first audio and the second audio into the broadcasting audio based on the naming information of the first audio and the second audio;
the naming information of the first audio carries a first paragraph identifier corresponding to the bystander, and the naming information of the second audio carries a second paragraph identifier corresponding to the dialogue text.
9. The method of claim 8, wherein the method further comprises:
constructing a text pronunciation sequence of the dialogue text based on the dialogue text;
calling an acoustic model to execute pronunciation splitting on the second audio to obtain pronunciation units corresponding to each audio frame in the second audio, and constructing an audio pronunciation sequence;
and the broadcasting audio is obtained based on the first audio and the second audio by splicing under the condition that the difference between the text pronunciation sequence and the audio pronunciation sequence is smaller than a preset threshold.
10. The method of claim 9, wherein constructing a text-to-sound sequence of the dialog text based on the dialog text comprises:
performing text regularization on the dialogue text, converting at least one of digital information, date information and currency information in the dialogue text into a first language text, and constructing a text pronunciation sequence of the first language text;
The first language text is a text belonging to a language corresponding to characters in the dialogue text.
11. A method according to any one of claims 1 to 3, wherein the method further comprises:
calling a natural language model to execute sound effect prediction on the bystander, and marking sound effect names in front of the bystander;
and superposing the sound effect audio corresponding to the sound effect name in the first audio.
12. The method of claim 11, wherein the superimposing the audio corresponding to the audio name in the first audio comprises:
executing text coding on the sound effect name, and extracting to obtain semantic features of the sound effect name, wherein the semantic features are hidden feature representations of the sound effect name in a semantic hidden space;
and sequentially executing audio coding and audio decoding on the semantic features to obtain audio corresponding to the audio names, wherein the audio coding is used for extracting hidden layer feature representation of an audio hidden layer space from the semantic features, and the audio decoding is used for coding the hidden layer feature representation of the audio hidden layer space to obtain audio information.
13. A method according to any one of claims 1 to 3, wherein the method further comprises:
Calling a natural language model to execute scene prediction on the literary works to obtain scene description information corresponding to the literary works;
and calling an image prediction model to execute image prediction on the scene description information to obtain the picture-inserting image of the literature.
14. The method of claim 13, wherein the method further comprises:
acquiring a text above the literary work, wherein the text above and the literary work belong to the same article;
invoking the natural language model to execute scene recognition on the text and the literary works together to obtain scene labels in articles to which the literary works belong and emergent characters appearing in the scenes;
the step of calling a natural language model to execute scene prediction on the literary works to obtain scene description information corresponding to the literary works, wherein the method comprises the following steps:
and calling the natural language model to execute scene prediction on the literary works, the scene labels and the out-of-the-field roles to obtain scene description information corresponding to the literary works.
15. A text information processing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a literary work which is natural language information comprising a plurality of characters;
The processing module is used for determining dialogue texts and side texts in the literary works, wherein the dialogue texts are texts initiated by at least one role, and the side texts are information except the dialogue texts in the literary works;
the generation module is used for generating first audio in a voice synthesis mode based on the bystander text; and generating a second audio in a recorded manner based on the dialog text;
and the processing module is also used for splicing the first audio and the second audio to obtain the broadcasting audio corresponding to the literary works.
16. A computer device, the computer device comprising: a processor and a memory, wherein at least one section of program is stored in the memory; the processor is configured to execute the at least one program in the memory to implement the method for processing text information according to any one of claims 1 to 14.
17. A computer readable storage medium having stored therein executable instructions that are loaded and executed by a processor to implement a method of processing text information according to any of the preceding claims 1 to 14.
18. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which a processor reads and executes them to implement a method of processing text information according to any of the preceding claims 1 to 14.
CN202311691691.XA 2023-12-08 2023-12-08 Text information processing method, device, equipment and storage medium Pending CN117690407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311691691.XA CN117690407A (en) 2023-12-08 2023-12-08 Text information processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311691691.XA CN117690407A (en) 2023-12-08 2023-12-08 Text information processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117690407A true CN117690407A (en) 2024-03-12

Family

ID=90129588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311691691.XA Pending CN117690407A (en) 2023-12-08 2023-12-08 Text information processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117690407A (en)

Similar Documents

Publication Publication Date Title
Juslin et al. The mirror to our soul? Comparisons of spontaneous and posed vocal expression of emotion
Durand et al. The Oxford handbook of corpus phonology
US8364488B2 (en) Voice models for document narration
Douglas-Cowie et al. Emotional speech: Towards a new generation of databases
US8370151B2 (en) Systems and methods for multiple voice document narration
CN109543021B (en) Intelligent robot-oriented story data processing method and system
CN110197658A (en) Method of speech processing, device and electronic equipment
Dhanjal et al. An automatic machine translation system for multi-lingual speech to Indian sign language
US20230316950A1 (en) Self- adapting and autonomous methods for analysis of textual and verbal communication
Richardson Poetic representation
CN109460548B (en) Intelligent robot-oriented story data processing method and system
Dhanjal et al. An optimized machine translation technique for multi-lingual speech to sign language notation
CN109065019B (en) Intelligent robot-oriented story data processing method and system
CN113923521A (en) Video scripting method
CN113707124A (en) Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN109979458A (en) News interview original text automatic generation method and relevant device based on artificial intelligence
CN117690407A (en) Text information processing method, device, equipment and storage medium
CN115529500A (en) Method and device for generating dynamic image
Wicaksono et al. Translation analysis of subtitle from English into Indonesian in The Raid 2 Movie
CN109241331B (en) Intelligent robot-oriented story data processing method
CN113555003A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
Doukhan et al. The GV-LEx corpus of tales in French: Text and speech corpora enriched with lexical, discourse, structural, phonemic and prosodic annotations
Júdice et al. Elderly speech collection for speech recognition based on crowd sourcing
KR102656262B1 (en) Method and apparatus for providing associative chinese learning contents using images
Campbell Extra-semantic protocols; input requirements for the synthesis of dialogue speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination