CN116486779A

CN116486779A - Audio generation method and device

Info

Publication number: CN116486779A
Application number: CN202310366111.3A
Authority: CN
Inventors: 凌小明
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-25

Abstract

The application discloses an audio generation method and device, and belongs to the field of voice processing. Comprising the following steps: obtaining target content in a document, wherein the target content comprises: character information and sentence information of dialogue text corresponding to the character information; analyzing the target content to obtain emotion information corresponding to the target content; based on the emotion information and the character information, conversation text in the target content is converted into conversation audio.

Description

Audio generation method and device

Technical Field

The application belongs to the field of voice processing, and particularly relates to an audio generation method and device.

Background

At present, listening to audio content such as audio novels and audio news by using mobile phones and tablet computers has become a living habit of many people, and the audio content usually needs to be recorded manually, but due to long time consumption and high cost, a mode of automatically converting text content into audio is adopted at present to obtain corresponding audio content.

However, the sound of the audio content obtained in this way is mechanical and monotonous, and it may be difficult for the user to substitute the audio content during listening to the audio content, and experience is poor.

Disclosure of Invention

The embodiment of the application aims to provide an audio generation method and device, which can solve the problems of sound machinery, monotony and poor user experience when text contents are directly converted into audio contents.

In a first aspect, an embodiment of the present application provides an audio generating method, including:

obtaining target content in a document, wherein the target content comprises: character information and sentence information of dialogue text corresponding to the character information;

analyzing the target content to obtain emotion information corresponding to the target content;

based on the emotion information and the character information, converting the dialogue text in the target content into a dialogue audio based on the emotion information and the character information.

In a second aspect, an embodiment of the present application provides an audio generating apparatus, including:

the acquisition module is used for acquiring target content in the document, wherein the target content comprises: character information and sentence information of dialogue text corresponding to the character information;

the analysis module is used for analyzing the target content to obtain emotion information corresponding to the target content;

And the conversion module is used for converting the dialogue text in the target content into dialogue audio based on the emotion information and the character information and converting the dialogue text in the target content into dialogue audio based on the emotion information and the character information.

In a third aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product stored in a storage medium, the program product being executable by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, the target content in the document is acquired, so that the statement containing the character information and the corresponding dialogue text can be accurately selected, and the sound characteristics conforming to the character characteristics can be accurately determined; the target content is analyzed to obtain emotion information, and emotion corresponding to the dialogue text in the target content can be effectively determined, so that when the dialogue text is converted into dialogue audio, the emotion information and the character information can be combined to obtain dialogue audio rich in character features and character emotion, and a listener can better substitute the dialogue audio into the content of an audio file in the process of listening to the audio, and user experience is improved.

Drawings

Fig. 1 is a schematic flow chart of an audio generating method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of emotion axis structure provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an emotion vector structure provided in an embodiment of the present application;

fig. 4 is a schematic diagram of an audio acquisition interface provided in an embodiment of the present application;

fig. 5 is a schematic diagram of an audio import interface according to an embodiment of the present application;

fig. 6 is a schematic diagram of an audio playing interface according to an embodiment of the present application;

FIG. 7 is a second flowchart of an audio generation method according to an embodiment of the present disclosure;

Fig. 8 is a schematic structural diagram of an audio generating apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The audio generation method and device provided by the embodiment of the application are described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of an audio generating method according to an embodiment of the present application, as shown in fig. 1, including:

step 110, obtaining target content in a document, wherein the target content comprises: character information and sentence information of dialogue text corresponding to the character information;

the document described in the embodiments of the present application may be specifically any document having text, such as a novel, news, biography, fairy tale, or magazine.

The character information described in the embodiments of the present application may refer to information of any character such as a novel character, a news character or a biographical character, and may specifically include any item of information such as a character name, a character key feature, etc., for example, the name of the character a is "Zhang san", and the corresponding character information may be determined to be a by "Zhang san" in a document; for another example, persona B is referred to as "Zhuang Tang master", and the corresponding persona information may be determined to be B by "Zhuang Tang master" in the document; for another example, the character key feature of character C is "man with deep scar print on face", and at this time, the corresponding character information C can be determined by "man with deep scar print on face" in the document.

The dialogue text described in the embodiment of the application may be a person dialogue text, a person psychology activity text, or a person self-statement text.

In an alternative embodiment, the persona dialog text may be, in particular, a verbal communication between two or more persons, which may be in written or spoken language, such as "what was recently busy? "," can be in preparation for examination. "equivalent communication language.

In an alternative embodiment, the character psychoactive text describes text of mental activities such as the character's mind, emotion, attitude, or feeling. The persona psychoactive text does not necessarily appear in the dialog, may be presented in narrative form, often with words describing the mind, such as thinking, imagination, guessing or suspicion, etc., e.g. "he psychologically silently thinks, whether this decision is correct or is his own too motivated? ".

In an alternative embodiment, the persona self-statement text may be, in particular, text that the persona expresses his own mind or feeling, whose subject may often be the first person, and the common language patterns may be a question back, exclamation, or question, etc., e.g., "is a refractory is my fate? "," just that curtain is true for terrible-! ".

In an alternative embodiment, regular expression matching may be utilized, where the regular expression is composed of various characters and meta characters, and is used to represent a specific pattern in a character string, for example, date, time, website, etc., and the regular expression matching may be set rules to determine a dialog text, specifically may be determined according to a typesetting format, punctuation marks, language patterns, etc. of the text, for example, a judging basis that a colon and a double quote appear together before a sentence text, or a judging basis that a question mark, an exclamation mark, etc. appear at the end of the sentence text, or a judging basis that an exclamation word, an oral word, etc. are included in the sentence text, or a judging basis that a psychological description word appears in the sentence text. For example, one can go from "what was recently busy? The question mark in "determines that the sentence text is a dialogue text, or the exclamation word in" wherein test is prepared "determines that the sentence text is a dialogue text, or" whether the decision is correct or too impulsive by himself/herself is silently thought of? The psychological description word "want" in "to judge that the sentence text is dialogue text, etc.

The target content described in the embodiments of the present application at least needs to include personal information and dialogue text corresponding to the personal information. The target content can be a complete sentence or a plurality of sentences.

In an alternative embodiment, when the character information and the corresponding dialogue text are in a sentence, the sentence can be determined as the target content. For example, in the novel, there is "male haha laugh with deep scar print on that face: ' you do so too-! 'the sentence contains character information' men with deep knife marks on the face 'and dialogue text to the corresponding character information' your will do so! The sentence may be determined as the target content.

In another alternative embodiment, when there is other text or text paragraph between the character information and the corresponding dialogue text in a whole sentence text, the whole sentence text is used as the target content. For example, a man in a novel having the entire sentence text "that has deep scar print on his face looks around as if he is looking for something. Suddenly, he sees a person wearing a black coat, the eye mutation is sharp, and the man who stares at that black coat closely, say: ' you really come. ' the whole sentence text contains character information ' men with deep knife marks on the face ' and dialogue text corresponding to the character information ' you really come '. The entire sentence text may be determined as the target content.

In an alternative embodiment, the target content is obtained from the document, specifically, the text in the document may be divided into sentences or text paragraphs by using natural language processing algorithms such as word segmentation, named entity recognition, sentence boundary detection, etc., the sentences or text paragraphs of the dialog text corresponding to the character information and the character information are identified, and the sentences containing the character information and the dialog text corresponding to the character information are taken as the target content.

Step 120, analyzing the target content to obtain emotion information corresponding to the target content;

in an alternative embodiment, the target content is analyzed, and in particular, emotion words expressing emotion in the target content may be analyzed for emotion, and emotion analysis may also be performed according to sentences of dialogue text in the target content.

The emotion information described in the embodiments of the present application may be specifically classified into three types of positive emotion, neutral emotion, and negative emotion, the positive emotion may be classified into favoring, pleasure, thank, etc., the neutral emotion may be classified into calm, confused, surprise, etc., and the negative emotion may be classified into complaint, anger, aversion, fear, sadness, etc.

In another alternative embodiment, the emotion information may be classified specifically by using emotion axes, where the emotion axes are preset to include two axes, namely, a high-low axis and a happy-unpleasant axis, and quadrants on the emotion axes correspond to four emotions: high, low, and the emotion is represented as position coordinates of the emotion axis coordinate system.

Fig. 2 is a schematic diagram of an emotion axis structure provided in the embodiment of the present application, as shown in fig. 2, a vertical axis in the emotion axis is from top to bottom, a horizontal axis is from left to right unpleasant to pleasant, wherein a first quadrant of a coordinate system is a high-pleasure quadrant, a second quadrant is a high-pleasure quadrant, a third quadrant is a low-pleasure quadrant, and a fourth quadrant is a low-pleasure quadrant. In another alternative embodiment, the mood information may in particular also represent the mood by means of a mood vector, wherein each dimension represents one of the mood states, e.g. a two-dimensional vector may be used to represent highly-and sadly-mood moods, wherein each dimension of the vector corresponds to the extent of these moods, respectively.

Fig. 3 is a schematic diagram of an emotion vector structure provided in an embodiment of the present application, as shown in fig. 3, including an x-axis, a y-axis and a diagonal line, where the x-axis represents a happy emotion, the left-to-right direction represents a happy emotion from low to high, the y-axis represents a sad emotion, the bottom-to-top direction represents a sad emotion from low to high, the diagonal line represents a neutral emotion, and if the coordinate point is on the diagonal line, the coordinate point represents a neutral emotion. In an alternative embodiment, the emotion information corresponding to the target content may specifically represent emotion information corresponding to the dialogue text in the target content, so that when the dialogue text is converted into the dialogue audio, emotion information can be combined conveniently to generate the dialogue audio with emotion.

And step 130, converting dialogue text in the target content into dialogue audio based on the emotion information and the character information.

In an alternative embodiment, a preset information base storing a plurality of preset pieces of character information may be generated in advance, and each piece of preset character information in the preset information base may be associated with and stored with a tone color information. According to the embodiment of the application, the tone information stored in association with the preset character information can be found through the character information and the preset character information in the preset information base, and the corresponding relation between the tone information and the character information is determined.

In an alternative embodiment, the tone color information may specifically include information such as tone color, tone, pitch, loudness, or speed of sound.

In an alternative embodiment, for example, the character named "Liqu" is characterized by a thick handicapped man, and the tone information corresponding to the character information may specifically be: the tone color is stronger at the low frequency range of 100-500Hz, and the feeling of coarse and sinking is formed; the tone value is in the range of 80-150Hz, so that the tone value is stable; the pitch value is in the range of 70-120Hz, in the bass region; the loudness is in the range of 70-90dB, and the volume is large; the speech speed is 120-150 words per minute, and the speech speed is high. According to the embodiment of the application, based on the emotion information and the character information, the dialogue text in the target content is converted into dialogue audio, specifically, tone information corresponding to the character information is adjusted according to the emotion information, and the dialogue text is converted into dialogue audio according to the adjusted tone information; the dialogue text can be converted into dialogue audio according to tone information corresponding to the character information, and the dialogue audio can be further adjusted by combining with emotion information.

In an alternative embodiment, the conversion of the dialog text in the target content into dialog audio may be implemented by a deep learning based speech synthesis algorithm such as Tacotron (towars End-to-End Speech Synthesis, end-to-End oriented speech synthesis), waveNet (waveform network), etc.

In an alternative embodiment, a bystander tone that does not correspond to any character information may be preset, and for the portion of the target content that is not the text of the conversation, the non-conversation text may be converted into audio based on the bystander tone alone without combining with emotion information. The user may also adjust the white pitch according to his preference.

In an alternative embodiment, the document audio is composed of both conversational and non-conversational text audio.

In the embodiment of the application, the target content in the document is acquired, so that statement information containing character information and corresponding dialogue text can be accurately selected, and further the sound characteristics conforming to the character characteristics can be accurately determined; the target content is analyzed to obtain emotion information, and emotion corresponding to the dialogue text in the target content can be effectively determined, so that when the dialogue text is converted into dialogue audio, the emotion information and the character information can be combined to obtain dialogue audio rich in character features and character emotion, and a listener can better substitute the dialogue audio into the content of an audio file in the process of listening to the audio, and user experience character features are improved.

Optionally, analyzing the target content to obtain emotion information corresponding to the target content, including:

matching the target content with emotion words in a preset emotion word bank, and determining the emotion words included in the target content, wherein the preset emotion word bank comprises at least one preset emotion word;

and determining emotion information corresponding to the target content based on the emotion words in the target content.

The preset emotion word library described in the embodiment of the present application may specifically be a vocabulary library including a large number of preset emotion words, for example, emotion words representing happiness, pleasure, high peace, etc., emotion words representing anger, or sudden jump, etc., emotion words representing sad, complaint, or 24774, etc., emotion words representing fear, panic, or timidity, etc.

In the embodiment of the application, the matching between the target content and the emotion words in the preset emotion word library can be specifically realized through a text precision matching Algorithm, such as KMP Algorithm (Knuth-Morris-Pratt algorism, knus-Morris-Pratt Algorithm) and violent matching Algorithm, and whether the words in the target content are identical to the emotion words in the preset emotion word library or not is judged, if so, the matching is successful, otherwise, the matching is failed. Successful matching means that there are words in the target content that can more clearly represent the emotional information.

In an optional embodiment, based on the emotion words in the target content, determining emotion information corresponding to the target content, specifically, converting the emotion words in the target content into Vector representations by using a TF-IDF (Term Frequency-inverse document Frequency) algorithm, a Word2vec (Word to Vector) algorithm, or the like, and then performing emotion analysis on the emotion words by using a trained Word bag model algorithm, a trained emotion dictionary algorithm, or the like, and determining classification and intensity of the emotion words, thereby determining the emotion information of the emotion words in the target content, and finally determining the emotion information corresponding to the target content.

In an alternative embodiment, the target content may include a plurality of emotion words, and may correspond to a plurality of emotion information, where when a plurality of emotion information exists, further processing is required in conjunction with the context, for example, when the person speaks through the opening after sad, then the happy emotion may be selected as a standard for the subsequent processing.

According to the method and the device, the target content and the emotion words in the preset emotion word bank can be matched through the preset emotion words in the preset emotion word bank, so that the emotion words in the target content can be effectively determined, emotion meanings can be obviously represented by the emotion words in the target content, and through analysis of the emotion words in the target content, accurate emotion information corresponding to a dialogue text can be effectively ensured to be obtained, so that accurate emotion can be represented when the dialogue text is converted into the dialogue audio, a listener can substitute the emotion words into the content of an audio file better in the process of listening to the audio, and user experience is improved.

Optionally, after matching the target content with the emotion words in the preset emotion word library, the method further includes:

and under the condition that the matching of the target content and the preset emotion word bank fails, analyzing the dialogue text in the target content to obtain emotion information corresponding to the target content.

In an alternative embodiment, the matching of the target content with the preset emotion word library may be specifically implemented by a text precision matching Algorithm, such as KMP Algorithm (Knuth-Morris-Pratt algorism, knus-Morris-Pratt Algorithm), and a brute force matching Algorithm, where whether the words in the target content are identical to the emotion words in the preset emotion word library is compared one by one.

In the case that the matching of the target content with the preset emotion word bank fails, as described in the embodiment of the present application, specifically, any emotion word in the preset emotion word bank is not included in the target content, and to determine the emotion information corresponding to the target content, analysis may be performed on the dialogue text or the context combined with the dialogue text in the target content, because the dialogue text or the context generally implies information that may represent emotion, for example, "' you re-move him to try again. 'she says that the teeth are clenched, the hands are clenched into a fist', and it can be seen that the description of "clenching the teeth" and "clenching into a fist" can also represent emotional information, although there is no emotional word in the words following the dialog text.

In an alternative embodiment, the analysis may be performed on the dialogue text in the target content, which may specifically be performed by using a trained LSTM (Long Short-Term Memory) algorithm to learn context information of the dialogue text in the target content, or using a trained recurrent neural network algorithm to learn features of the dialogue text, and at the same time, may be performed by using techniques such as data enhancement, regularization, and so on to improve the generalization capability of the model. And then, carrying out emotion classification by using a softmax classifier so as to obtain emotion information corresponding to the target content.

According to the method and the device for processing the dialogue text, the dialogue text in the target content is analyzed, the situation that matching of the target content and the preset emotion word bank fails can be effectively solved, when emotion words which can clearly represent emotion information do not exist in the target content, accuracy of the emotion information can be effectively guaranteed through analysis of the dialogue text, and therefore accurate emotion can be represented when the dialogue text is converted into dialogue audio, and user experience is improved.

Optionally, converting the dialogue text in the target content into the dialogue audio based on the emotion information and the character information, and converting the dialogue text in the target content into the dialogue audio based on the emotion information and the character information, includes:

Acquiring first tone information based on preset character information matched with the character information under the condition that the character information is successfully matched with the preset character information in a preset information base; wherein, at least one tone information is stored in the preset information base, and each tone information corresponds to one preset character information;

performing tone adjustment on the first tone information corresponding to the character information based on the emotion information to obtain second tone information;

and performing text-to-audio conversion on the dialogue text based on the second tone information to obtain dialogue audio.

The preset information base described in the embodiments of the present application may specifically store a plurality of preset personal information and tone information corresponding to each preset personal information, and may also store bystander tone information corresponding to bystander, but not corresponding to preset personal information.

In an alternative embodiment, the sex of the bye may be preset, the tone, pitch, loudness or speed of the bye tone information may be preset to be at an intermediate value, and the sex of the bye and the bye tone information may be modified by the user.

The preset personal information described in the embodiment of the present application may specifically be preset personal information stored in a preset information base in advance according to a person appearing in a document, and may specifically include information such as a person name, or a person key feature.

In an alternative embodiment, tone color information in the preset information base corresponds to preset character information one by one, and represents sound characteristics of the corresponding preset character information, and specifically may include information such as tone color, tone, pitch, loudness, speed of speech, and the like of sound. Specifically, the preset setting can be performed according to the character information features of the corresponding preset character information. For example, a character named "white five" characterized by a gentle female may have its corresponding tone information set as follows: the tone color is between 100-200Hz in the low frequency range, so that a smooth and mild feel is formed; the tone value is in the range of 180-220Hz, so that the tone value is stable; the pitch value is in the range of 400-500Hz, and is softer; the loudness is in the range of 50-60dB, and the volume is not large; the speech rate is between 90 and 100 words per minute, and the speed is relatively slow.

In an alternative embodiment, the person information is matched with the preset person information in the preset information base, specifically, whether the person information is identical with the preset person information in the preset information base or not is compared one by one through a text precision matching algorithm, such as a KMP algorithm and a violent matching algorithm, when the person information contains a person name or a person name, the person name or the person name in the preset person information is preferentially matched, and when the person names or the person names of the person information are completely matched, the matching is considered to be successful.

In another alternative embodiment, if the character information does not include a name or a name of the character, and only includes a key feature of the character or a part of key features, the character information is matched with the key features of the character in the preset character information, a key degree ratio threshold may be preset, for example, 90% or 85%, and if the matching degree of the character information and the key feature exceeds the key degree ratio threshold, the matching is considered to be successful.

In another alternative embodiment, if only a human pronoun, such as "he" or "she", is included in the character information, analysis is required in conjunction with the context of the target content, further supplementing the character information, and the matching may be considered successful when the supplemented character information satisfies the above two embodiments.

In an alternative embodiment, the matching of the personal information with the preset personal information in the preset information base is successful, which means that there is one preset personal information matched with the personal information among the plurality of preset personal information.

In an alternative embodiment, the first tone color information is obtained based on the preset character information matched with the character information, specifically, a corresponding relationship is established between the character information matched with the preset character information and tone color information stored in association with the preset character information in the preset information base, and the tone color information with the corresponding relationship established with the character information is used as the first tone color information.

The first tone color information described in the embodiments of the present application may specifically represent sound characteristics conforming to characteristics of the personal information, and may include information such as tone color, tone, pitch, loudness, or speed of speech of the sound.

In an alternative embodiment, the user may adjust the first tone color information according to his own preference. In an optional embodiment, tone adjustment is performed on the first tone information corresponding to the character information, and specifically, according to the category and intensity of the emotion information, the direction and the variation amplitude of the first tone information to be adjusted are determined. For example, in the case where the emotion information is of the "anger" category, the tone color in the first tone color information may be adjusted in a direction to increase the bass component and decrease the treble component, and the tone, pitch, loudness, and speed may be adjusted in a direction to increase, and the more the emotion information is biased to be extremely anger, the larger the overall change range of the first tone color information may be. The adjusted first tone information is the second tone information.

The text-to-audio conversion of the dialog text based on the second timbre information described in the embodiments of the present application may specifically be to obtain the dialog audio by using a trained Tacotron model or a trained WaveNet model.

In an alternative embodiment, the Tacotron model is a speech synthesis algorithm based on a cyclic neural network, dialogue text and second tone information are input into the trained Tacotron model, firstly, part-of-speech labeling, phonetic transcription and other processes are carried out on the input dialogue text, the dialogue text is converted into a continuous vector space, then, a decoder based on the cyclic neural network is adopted, text feature vectors and the second tone information are combined to generate an audio sequence of the next time step, and finally, smooth, natural and emotion-rich speech is generated; the generated speech is further processed, such as denoising, audio enhancement, audio equalization, etc., and finally the speech is led into audio, finally the dialogue audio is obtained.

In another alternative embodiment, the WaveNet model is a neural network-based speech synthesis algorithm, the dialogue text and the second tone information are input into the trained WaveNet model, the dialogue text is converted into a corresponding speech feature sequence, the speech feature sequence is converted into a discrete feature sequence through one-hot coding, and the discrete feature sequence is converted into a continuous audio waveform matched with the second tone information according to the second tone information by using a convolutional neural network, so that dialogue audio is finally generated.

In another alternative embodiment, text-to-audio conversion may be performed on the dialog text according to the first timbre information to obtain unadjusted dialog audio, and timbre adjustment may be performed on the unadjusted dialog audio based on the mood information. The tone color adjustment mode can be specifically that the tone color of the audio is adjusted through a sound filter, so that the tone color is adjusted; the tone is adjusted by adjusting the frequency of the audio; the pitch is adjusted by changing the sampling rate of the audio; the loudness of the audio is adjusted through a volume adjuster, so that the adjustment of the loudness is realized; the voice speed is adjusted by changing the playing speed of the audio, and finally the dialogue audio is obtained.

In the embodiment of the application, if the character information is successfully matched with the preset character information, the first tone information corresponding to the character information is already stored in the preset information base, and the first tone information corresponding to the preset character information in the preset information base can be acquired based on the preset character information matched with the character information, so that the sound characteristics corresponding to the character information can be effectively acquired; the category and intensity of the emotion information can be determined based on the emotion information, and then the adjustment direction and the adjustment intensity of each item of information in the first tone information can be effectively determined according to the category and intensity of the emotion information, so that tone adjustment is carried out on the first tone information, the adjusted first tone information, namely the second tone information, can effectively show sound characteristics with emotion, so that dialogue audio obtained based on the second tone information can show sound rich in emotion, a listener can better substitute the dialogue audio into the content of an audio file in the process of listening to the audio, and user experience is improved.

Optionally, before the step of performing tone adjustment on the first tone information corresponding to the character information based on the emotion information to obtain the second tone information, the method further includes:

receiving a first input for importing target audio by a user under the condition that the character information is not matched with preset character information in the preset information base;

and generating first tone information corresponding to the character information based on the audio characteristics of the target audio in response to the first input, and storing the first tone information corresponding to the character information into the preset information base.

In an alternative embodiment, the person information is matched with the preset person information in the preset information base, specifically, whether the person information is identical to the preset person information in the preset information base or not is compared one by one through a text precision matching algorithm, such as a KMP algorithm and a violence matching algorithm.

In the case where the personal information is not matched with the preset personal information in the preset information base, which is described in the embodiment of the present application, specifically, all the preset personal information in the preset information base is not matched with the personal information.

In another alternative embodiment, if the character information is merely a human pronoun, for example, "he" or "she", the character information needs to be further supplemented by analyzing in conjunction with the context of the target content, and the character information after the supplementing is also considered to be not matched with any preset character information in the preset information library if the character information cannot be matched with any preset character information through the step of the text precision matching algorithm.

In an alternative embodiment, the person information does not match the preset person information, which means that the person information is not stored in the preset information base before, and the person information may be added to the preset information base.

In an alternative embodiment, the audio may be specifically imported by the user, and tone information corresponding to the character information newly added into the preset information base is determined according to the audio imported by the user.

As described in the embodiments of the present application, the target audio may specifically be audio that the user wants to import, and the sound in the audio may represent the sound characteristics that the user wants the character to present. In an alternative embodiment, the user may select the local audio file on the device as the target audio on the one hand, and may record the audio on the other hand, and the recorded audio is taken as the target audio.

In an alternative embodiment, the first input is used to import the target audio, and the first input may be an operation of importing the stored target audio or an operation of recording the target audio, and illustratively, the first input includes, but is not limited to: the embodiment of the invention is not limited, and the user can input the record mark by clicking through a touch device such as a finger or a handwriting pen, or input the record mark by clicking through the audio input mark by the user, or input a voice command by the user, or input a specific gesture by the user, or input other feasibility, and the embodiment of the invention can be specifically determined according to the actual use requirement. The specific gesture in the embodiment of the application may be any one of a single-click gesture, a sliding gesture, a dragging gesture, a pressure recognition gesture, a long-press gesture, an area change gesture, a double-press gesture and a double-click gesture; the click input in the embodiment of the application may be single click input, double click input, or any number of click inputs, and may also be long press input or short press input.

In an alternative embodiment, for example, fig. 4 is a schematic diagram of an audio collection interface provided in an embodiment of the present application, as shown in fig. 4, including: the audio collection interface 400 includes a recording identifier 401, and the first input may be that the user presses the recording identifier 401 for long time to perform the operation of recording the target audio. In another alternative embodiment, for example, fig. 5 is a schematic diagram of an audio import interface provided in an embodiment of the present application, as shown in fig. 5, including: the audio import interface 500 includes an audio import identifier 501, and the first input may be selecting a target audio stored locally, clicking the audio import identifier 501, and performing an operation of importing the target audio.

In an alternative embodiment, after receiving the first input and acquiring the target audio, audio feature analysis is performed on the target audio to extract audio features of the target audio.

In an alternative embodiment, the extracting of the audio feature of the target audio may specifically be by extracting a tone color, a tone, a pitch, a loudness or a speech speed in the audio feature of the target audio, and may specifically be implemented by a spectrum analysis method or a cepstrum analysis method; extracting the pitch and pitch in the audio characteristics of the target audio, which can be realized by an autocorrelation method or a fundamental frequency detection method; the loudness in the audio features of the target audio is extracted, which can be realized by an energy analysis method or a short-time energy analysis method; the speech rate in the audio features of the target audio is extracted, which can be realized by a method based on a time domain and a frequency domain.

In an alternative embodiment, the audio characteristics of the target audio may include, in particular, tone, pitch, loudness, or speed of speech.

In an optional embodiment, the extracted audio feature of the target audio is used as the first tone information and is stored in association with the character information newly added into the preset information base, so that the first tone information corresponding to the character information is stored in the preset information base.

According to the method and the device for processing the voice information, the first input of the target voice frequency is received, the first tone information is generated according to the voice frequency characteristics of the target voice frequency, and the first tone information is stored in association with the character information in the preset information base, so that the situation that the character information is not matched with the preset character information in the preset information base is dealt with, the situation that the character information is missing in the preset information base and the first tone information corresponding to the character information is missing can be effectively dealt with, and the fact that each character information has the corresponding first tone information is effectively guaranteed.

Fig. 6 is a schematic diagram of an audio playing interface provided in an embodiment of the present application, as shown in fig. 6, including: the audio playing interface 600, the audio playing interface 600 includes a playing identifier 601, and the user clicks the playing identifier 601 to play the audio content converted by the document.

Fig. 7 is a second schematic flow chart of an audio generating method according to an embodiment of the present application, as shown in fig. 7, including:

in step 701, a sentence containing a dialogue text is obtained, where the sentence containing the dialogue text may specifically refer to a sentence containing the dialogue text, or may refer to a natural segment containing the dialogue text, or may refer to several natural segments containing the dialogue text.

In step 702, it is determined whether or not the personal information is recognized, specifically, whether or not the personal information is recognized may be determined by performing a determination and recognition on a sentence including a dialogue text. If the personal information can be recognized, the process proceeds to step 704 with the recognized personal information as the personal information corresponding to the sentence. If no persona information is identified, step 703 continues.

In step 703, the default bystander information is set, specifically, the character information corresponding to the sentence may be set as the default bystander information, and the process proceeds to step 706.

Step 704, judging whether the character information is in the preset information base, specifically, matching the character information corresponding to the sentence with the preset character information in the preset information base, if the matching is successful, continuing to step 706, and if the matching is unsuccessful, continuing to step 705.

Step 705, creating first tone information corresponding to the character information, and storing the first tone information in a preset information base, specifically, adding the character information in the preset information base, importing target audio by a user, extracting audio features of the target audio, determining the first tone information according to the audio features of the target audio, and storing the first tone information in association with the newly added character information in the preset information base. Step 706 is continued.

Step 706, determining whether emotion information is recognized, specifically, extracting emotion words in sentences including dialogue text, recognizing emotion information corresponding to the emotion words according to the emotion words, and if emotion words are not detected, performing emotion analysis on the dialogue text to recognize the emotion information. If the mood information can be identified, then the mood information is deemed identified and step 708 is continued. If emotion information cannot be identified, step 707 continues.

Step 707, using the side tone information, specifically, preset a side tone information different from the first tone information corresponding to all the character information, where the side tone information corresponds to a default neutral emotion, and may be calm and stable.

Step 708, generating dialogue audio by using emotion information corresponding to the dialogue text, and playing the dialogue audio. In particular, the dialogue text can be converted into dialogue audio in combination with emotion information, and the dialogue audio can effectively represent emotion-rich sounds.

According to the audio generation method provided by the embodiment of the application, the execution main body can be an audio generation device. In the embodiment of the present application, an example of an audio generating device executing an audio generating method is described as an audio generating device provided in the embodiment of the present application. Fig. 8 is a schematic structural diagram of an audio generating apparatus according to an embodiment of the present application, as shown in fig. 8, including:

an obtaining module 810, configured to obtain target content in a document, where the target content includes: character information and sentence information of dialogue text corresponding to the character information;

the analysis module 820 is configured to analyze the target content to obtain emotion information corresponding to the target content;

and a conversion module 830, configured to convert the dialogue text in the target content into dialogue audio based on the emotion information and the character information.

Optionally, the analysis module is specifically configured to:

Optionally, the analysis module is specifically further configured to:

Optionally, the conversion module is specifically configured to:

Optionally, the conversion module is specifically further configured to:

In the embodiment of the application, the target content in the document is acquired, so that statement information containing character information and corresponding dialogue text can be accurately selected, and further the sound characteristics conforming to the character characteristics can be accurately determined; the target content is analyzed to obtain emotion information, and emotion corresponding to the dialogue text in the target content can be effectively determined, so that when the dialogue text is converted into dialogue audio, the emotion information and the character information can be combined to obtain dialogue audio rich in character features and character emotion, and a listener can better substitute the dialogue audio into the content of an audio file in the process of listening to the audio, and user experience is improved.

The audio generating device in the embodiment of the application may be an electronic device, or may be a component in the electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the electronic device may be a mobile phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, mobile internet appliance (Mobile Internet Device, MID), augmented reality (augmented reality, AR)/Virtual Reality (VR) device, robot, wearable device, ultra-mobile personal computer, UMPC, netbook or personal digital assistant (personal digital assistant, PDA), etc., but may also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The audio generating device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.

The audio generating apparatus provided in the embodiments of the present application can implement each process implemented by the embodiments of the methods of fig. 1 to 7, and in order to avoid repetition, a description is omitted here.

Optionally, fig. 9 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, as shown in fig. 9, and further provides an electronic device 900, including a processor 901 and a memory 902, where a program or an instruction capable of running on the processor 901 is stored in the memory 902, and the program or the instruction when executed by the processor 901 implements each step of the foregoing embodiment of the audio generation method, and the steps can achieve the same technical effect, so that repetition is avoided and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 10 is a schematic hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1000 includes, but is not limited to: radio frequency unit 1001, network module 1002, audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, and processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 10 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

Wherein the processor 1010 is configured to obtain target content in a document, wherein the target content comprises: character information and sentence information of dialogue text corresponding to the character information;

based on the emotion information and the character information, conversation text in the target content is converted into conversation audio.

The processor 1010 is configured to match the target content with a mood word in a preset mood word stock, and determine the mood word included in the target content, where the preset mood word stock includes at least one preset mood word;

The processor 1010 is configured to analyze a dialogue text in the target content to obtain emotion information corresponding to the target content if the target content fails to match with the preset emotion word bank.

The processor 1010 is configured to obtain first tone color information based on preset personal information matched with the personal information if the personal information is successfully matched with the preset personal information in the preset information base; wherein, at least one tone information is stored in the preset information base, and each tone information corresponds to one preset character information;

Wherein, the user input unit 1007 is configured to receive a first input for importing a target audio from a user if none of the character information matches preset character information in the preset information base;

It should be understood that in the embodiment of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, and the graphics processor 10041 processes image data of still pictures or videos obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes at least one of a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

The memory 1009 may be used to store software programs as well as various data. The memory 1009 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 1009 may include volatile memory or nonvolatile memory, or the memory x09 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). Memory 1009 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.

The processor 1010 may include one or more processing units; optionally, the processor 1010 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, and the like, and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored, and when the program or the instruction is executed by a processor, the processes of the embodiment of the audio generation method are implemented, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.

The embodiment of the application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction, so as to implement each process of the embodiment of the audio generation method, and achieve the same technical effect, so that repetition is avoided, and no redundant description is provided here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

The embodiments of the present application provide a computer program product stored in a storage medium, where the program product is executed by at least one processor to implement the respective processes of the embodiments of the audio generation method described above, and achieve the same technical effects, and are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. An audio generation method, comprising:

2. The audio generation method according to claim 1, wherein analyzing the target content to obtain emotion information corresponding to the target content includes:

3. The audio generation method according to claim 2, further comprising, after matching the target content with the mood words in a preset mood word stock:

4. The audio generation method according to claim 1, wherein converting the dialog text in the target content into dialog audio based on the emotion information and the character information, comprises:

and performing audio conversion on the dialogue text based on the second tone information to obtain dialogue audio.

5. The audio generation method according to claim 4, wherein, before the step of obtaining second tone information, tone adjustment is performed on the first tone information corresponding to the character information based on the emotion information, further comprising:

6. An audio generating apparatus, comprising:

and the conversion module is used for converting the dialogue text in the target content into dialogue audio based on the emotion information and the character information.

7. The audio generating device according to claim 6, wherein the analysis module is specifically configured to:

8. The audio generating device according to claim 7, wherein the analysis module is further specifically configured to:

9. The audio generating device according to claim 6, wherein the conversion module is specifically configured to:

10. The audio generating device according to claim 9, wherein the conversion module is further specifically configured to: