CN109801618B

CN109801618B - Audio information generation method and device

Info

Publication number: CN109801618B
Application number: CN201711137172.3A
Authority: CN
Inventors: 李廣之; 王楠; 康世胤; 陀得意; 朱晓龙; 张友谊; 林少彬; 郑永森; 邹子馨; 何静; 陈在真
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2022-09-13
Anticipated expiration: 2037-11-16
Also published as: CN109801618A

Abstract

The embodiment of the invention discloses a method and a device for generating audio information, which are used for generating fusion audio matched with a rhythm by inputting a text. The embodiment of the invention provides a method for generating audio information, which is characterized by comprising the following steps: acquiring text information and first audio information, wherein the text information comprises at least one word; performing linguistic analysis on the text information to respectively obtain linguistic characteristics of at least one word; respectively carrying out phoneme-level duration prediction and duration self-adaptive adjustment on the at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word; generating second audio information corresponding to the at least one word according to the phoneme duration prediction value of the at least one word and the corresponding linguistic feature; and synthesizing the first audio information and the second audio information to obtain fused audio information.

Description

Audio information generation method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for generating audio information.

Background

Music plays an irreplaceable role in life of people, and can be divided into a plurality of music types according to different rhythms, wherein Hip-hop music (Rap music or Hip hop) is a music style with rhythmic reciting (vocalization) following accompaniment, and the accompaniment is generated by a music sampling means. At present, the generation mode of the audio information is mainly completed by people performing artificial creation, for example, hip-hop music can be compiled by a professional hip-hop singer. But for people without a music base, the ability to compose music is not provided at all.

In order to realize threshold-free music creation, music which can be enjoyed by common users needs to be generated, and the following two music generation modes are generated in the prior art: the first is a method of converting the sound of a video into music, and the second is a method of converting a voice recorded by a user into music. For the first method of generating music through sound of video, it is necessary to process video data to extract sound data carried by the video, and then match the sound with background music, so as to generate music for users to enjoy. For the second method of generating music by voice, it is not necessary to process video data, and only voice and background music need to be synthesized to generate music for the user to enjoy.

In the above technical solution for generating music, only the sound or voice of the video can be simply matched with the background music, and the music generation method does not consider the audio characteristics of the sound or voice itself, so that the generated music cannot be matched with the content input by the user.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating audio information, which are used for generating audio information matched with a rhythm by inputting a text.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for generating audio information, where the method includes:

acquiring text information and first audio information, wherein the text information comprises at least one word;

performing linguistic analysis on the text information to respectively obtain linguistic characteristics of at least one word;

respectively carrying out phoneme-level duration prediction and duration self-adaptive adjustment on the at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word;

generating second audio information corresponding to the at least one word according to the phoneme duration prediction value of the at least one word and the corresponding linguistic feature;

and synthesizing the first audio information and the second audio information to obtain fused audio information.

In a second aspect, an embodiment of the present invention further provides an apparatus for generating audio information, where the apparatus includes:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring text information and first audio information, and the text information comprises at least one word;

the linguistic analysis module is used for carrying out linguistic analysis on the text information to respectively obtain the linguistic characteristics of at least one word;

the duration prediction module is used for respectively carrying out duration prediction and duration self-adaptive adjustment on the at least one word at a phoneme level through a duration prediction model to obtain a phoneme duration prediction value of the at least one word;

the audio generation module is used for generating second audio information corresponding to the at least one word according to the phoneme duration predicted value of the at least one word and the corresponding linguistic feature;

and the audio fusion module is used for synthesizing the first audio information and the second audio information to obtain fused audio information.

In a third aspect, the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above aspects.

According to the technical scheme, the embodiment of the invention has the following advantages:

in the embodiment of the invention, the text information and the first audio information are obtained, and the linguistic analysis is carried out on the text information to respectively obtain the linguistic characteristics of at least one word. And performing phoneme-level duration prediction and duration self-adaptive adjustment on at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word. And generating second audio information corresponding to the at least one word according to the phoneme duration predicted value of the at least one word and the corresponding linguistic feature, and finally synthesizing the first audio information and the second audio information to obtain fused audio information. According to the embodiment of the invention, linguistic analysis can be carried out on the text information only by acquiring the text information, and the second audio information generated through the phoneme duration prediction value and the linguistic feature is subjected to duration prediction and duration self-adaptive adjustment through the duration prediction model, so that the second audio information is more easily adapted to the rhythm of the first audio information, and further, the fused audio information with more rhythm can be formed. The finally generated fusion audio information can be closely associated with the acquired text information and the first audio information, and fusion audio information matched with the rhythm can be generated through automatic processing of the text information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings.

Fig. 1 is a schematic flowchart illustrating a method for generating audio information according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a music generation process based on a duration prediction model according to an embodiment of the present invention;

fig. 3-a is a schematic structural diagram of a device for generating audio information according to an embodiment of the present invention;

FIG. 3-b is a schematic diagram of a linguistic analysis module according to an embodiment of the present invention;

fig. 3-c is a schematic diagram of a structure of an audio generating module according to an embodiment of the present invention;

fig. 3-d is a schematic structural diagram of another audio information generating apparatus according to an embodiment of the present invention;

fig. 3-e is a schematic structural diagram of another audio information generating apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal to which the audio information generating method according to the embodiment of the present invention is applied.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein, are intended to be within the scope of the present invention.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The following are detailed below.

An embodiment of the method for generating audio information according to the present invention is particularly applicable to generating fused audio information with a rhythm matching the text based on the text information. Referring to fig. 1, a method for generating audio information according to an embodiment of the present invention includes the following steps:

101. text information and first audio information are obtained, the text information including at least one word.

In the embodiment of the present invention, the terminal may first acquire the text information and the first audio information, where the text information may be text information input to the terminal by a user, and the text information may be used to synthesize a fused audio with the first audio information, where the text information input in the terminal may be text previously stored in the terminal by the user, obtained by browsing a web page by the user, or converted by inputting a voice by the user. The first audio information may specifically be background music stored in the terminal, or song tracks, station audio content, and the like stored in the terminal, which is not limited herein.

102. And performing linguistic analysis on the text information to respectively obtain the linguistic characteristics of at least one word.

In the embodiment of the present invention, after the text information and the first audio information are read, the text information may be subjected to linguistic analysis, at least one word is segmented from the text information, and a corresponding linguistic feature is generated for each word. The linguistic features are features described in the text on the language. For example, the linguistic analysis of the text information stored in the terminal can be performed sentence by sentence, lexical, grammatical and semantic analysis to determine the low-level structure of the sentence and the composition of the phonemes of each word.

In some embodiments of the present invention, the step 102 of performing linguistic analysis on the text information to obtain linguistic characteristics of at least one word respectively includes:

sentence breaking is carried out on the text information to obtain a sub-text of at least one sentence;

performing word segmentation on the subforms of each sentence according to the part of speech and rhythm to obtain words corresponding to each subforme;

and respectively extracting the linguistic characteristics of the words corresponding to each subfile to obtain the linguistic characteristics of at least one word.

The terminal may perform sentence segmentation processing on the text information, that is, may segment a piece of text information into sub-texts of at least one sentence, perform word segmentation on each sub-text, for example, perform word segmentation on the sub-texts according to part-of-speech characteristics and prosodic characteristics, so that each sub-text may be segmented into one or more words, and finally extract linguistic characteristics for each word. Linguistic features can be extracted for at least one word, processing of polyphones and the like can be executed, and therefore information is provided for subsequent feature extraction through text analysis of the word, and the method mainly comprises the following processing procedures: pronunciation generation, prosody prediction, part-of-speech prediction, and the like.

103. And performing phoneme-level duration prediction and duration self-adaptive adjustment on at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word.

In the embodiment of the present invention, after the linguistic features of the at least one word are extracted, the phoneme duration of the at least one word may be predicted by using the linguistic features of the at least one word, for example, the linguistic features of the at least one word may be input into a duration prediction model, and the duration prediction model may be generated by a neural network algorithm based on the word with known phoneme duration. The duration prediction model generated in advance in the embodiment of the invention can be used for duration prediction and duration self-use adjustment at a phoneme level. Here, a phoneme is a pronunciation element constituting a word, and a phoneme is a minimum unit or a minimum voice fragment constituting a syllable and is a minimum voice unit divided from the viewpoint of sound quality. For example, the chinese syllable wuen has two phones and the ji ajan has four phones. In the embodiment of the invention, the word can comprise at least one phoneme, the phoneme-level duration prediction means that the duration prediction of the word by using the duration prediction model takes the phoneme as a duration unit, and if one word consists of a plurality of phonemes, the sum of the durations of all phonemes forming the word can be obtained after the duration prediction is carried out on the word. Because the music is different from the common speaking and has rhythmicity, a self-adaptive adjustment is carried out on the result of the duration prediction, so that each character can be in the beat, and the original pronunciation is ensured not to be changed. In the embodiment of the present invention, the linguistic characteristics of at least one word may be input into the duration prediction model, and then the duration prediction model may output the phoneme duration prediction value of the at least one word. In the embodiment of the invention, the phoneme duration of each word can be predicted by adopting the duration prediction model, and the phoneme duration of each word is adaptively adjusted through the duration prediction model, so that phonemes included in at least one word cut from a text have duration prediction values, and therefore, the phoneme duration prediction values of at least one word can be used for generating audio information which is easier to match with rhythm.

In some embodiments of the present invention, the generation of the duration prediction model may be accomplished as follows. The method for generating the audio information provided by the embodiment of the invention further comprises the following steps:

extracting phoneme duration from training samples in a training corpus;

taking the extracted phoneme duration as an input parameter of a neural network, and carrying out phoneme duration training on a duration prediction model;

after the training of the duration prediction model is finished, testing the phoneme duration of the duration prediction model by using a test sample in the test corpus;

and outputting the time length prediction model after the test is finished.

In the embodiment of the present invention, a text corpus may be obtained first to generate a duration prediction model, for example, a training corpus is obtained, training samples are stored in the training corpus, a phoneme duration of a word in each sample is obtained for the training samples, the phoneme duration is used as a known value to train the duration prediction model, and for example, model training may be completed in a neural network learning manner. Wherein, the model is trained by using the known value of the phoneme duration, mainly training the model parameters, so that the speech can be adaptive to the rhythm. The embodiment of the invention also can provide a test corpus, the test corpus is used for storing the test samples, after the training of the duration prediction model is completed, the test samples in the test corpus are used for testing the phoneme duration of the duration prediction model, and after the duration prediction model is converged, the tested duration prediction model is output.

104. Second audio information corresponding to the at least one word is generated based on the phoneme duration prediction value of the at least one word and the corresponding linguistic feature.

In the embodiment of the present invention, after the phoneme duration predicted value of the at least one word is obtained in the foregoing step 103, the audio information may be generated based on the phoneme duration predicted value of the at least one word and the linguistic feature of the corresponding word, so as to be distinguished from the first audio information obtained in step 101, where the audio information generated by the phoneme duration predicted value and the linguistic feature of the at least one word is defined as the second audio information. For example, the duration prediction value of each phoneme is referred To, and the linguistic feature of each phoneme is converted from a character To a voice, wherein specifically, a Text To Speech (TTS) may be used, and the duration prediction value of the phoneme and the linguistic feature of the word obtained through the foregoing steps are converted into a voice.

In some embodiments of the present invention, the step 104 of generating the second audio information corresponding to the at least one word from the phoneme duration prediction value and the corresponding linguistic feature of the at least one word comprises:

respectively carrying out acoustic feature prediction on at least one word according to the phoneme duration prediction value of the at least one word and the corresponding linguistic feature to obtain acoustic features respectively corresponding to the at least one word;

converting the acoustic characteristics corresponding to the at least one word into sound segments corresponding to the at least one word;

and synthesizing the sound segments respectively corresponding to at least one word together to obtain second audio information.

The phoneme duration prediction value of each word and the linguistic characteristics of the word can predict the acoustic characteristics of the word, the acoustic characteristics can also be called as sound characteristics, and then the acoustic characteristics of the word are converted into a sound fragment corresponding to the word through a language synthesis tool. For example, words or phrases are extracted from speech synthesis, and linguistic descriptions are converted into speech waveforms. After each word generates a corresponding sound segment, the sound segments corresponding to all the words are synthesized together to obtain complete second audio information.

105. And synthesizing the first audio information and the second audio information to obtain fused audio information.

In this embodiment of the present invention, the text information may be converted into second audio information through step 104, where the second audio information is speech including lyric content corresponding to the text information, and then the second audio information and the first audio information are combined to generate final fused audio information, for example, the first audio information may specifically be background music, and then the terminal may synthesize the second audio information and the background music together to obtain the fused audio information. In the embodiment of the invention, the fusion audio information is obtained by synthesizing the second audio information converted from the text information and the first audio information, so that a user can hear the fusion audio with lyrics and rhythms when playing the fusion audio information. For example, the second audio information converted from the Text information is synthesized with the hip-hop background music To obtain hip-hop music, thereby completing the Text To hop (TTR) processing.

In some embodiments of the present invention, after the step 104 generates the second audio information corresponding to the at least one word according to the phoneme duration prediction value and the corresponding linguistic feature of the at least one word, embodiments of the present invention may further include the following steps in addition to performing the step 105:

judging whether the second audio information and the first audio information meet rhythm matching according to the phoneme duration prediction value of the second audio information;

if the second audio information and the first audio information satisfy the prosody matching, the step 105 of synthesizing the first audio information and the second audio information to obtain the fused audio information is triggered.

In the embodiment of the present invention, a corresponding prosodic feature may be set for the first audio information. And judging whether the second audio information and the first audio information meet prosody matching or not according to the phoneme duration predicted value of the second audio information, wherein the prosody feature is a feature in prosody possessed by the audio information, and the prosody feature can be output in a neural network model detection mode. For example, after the prosodic feature is detected, the first audio information may be stored in an audio database, and when the first audio information is acquired in step 101, the prosodic feature corresponding to the first audio information may be acquired. Only when the rhythm of the second audio information is matched with that of the first audio information, each word in the lyrics can be ensured to be in the beat, and meanwhile, the original pronunciation is ensured not to be changed.

In some embodiments of the present invention, the method for generating audio information provided by the embodiments of the present invention may further include, in addition to performing the foregoing steps, the following steps:

if the prosody matching is not satisfied between the second audio information and the first audio information, performing prosody matching on the phoneme duration prediction value of the second audio information and audio data in an audio database, and screening out audio data from the audio database, wherein each audio data in the audio database corresponds to a prosody feature;

and synthesizing the generated second audio information and the audio data screened from the audio database to obtain fused audio information.

In the embodiment of the present invention, an audio database may be set, for example, the audio database may be specifically a background music library. The audio database stores a plurality of audio data, and each audio data corresponds to a prosodic feature. And performing prosody matching on the phoneme duration predicted value of the second audio message and the audio information in the audio information base to obtain audio data screened from the audio database, wherein the audio data matched with the prosody of the second audio information can be used for generating final fusion audio information.

As can be seen from the description of the embodiment of the present invention, the text information and the first audio information are obtained, and then the text information is subjected to linguistic analysis to respectively obtain linguistic characteristics of at least one word, where the at least one word is a word obtained by segmenting a text. And respectively carrying out phoneme-level duration prediction and duration self-adaptive adjustment on at least one word through a duration prediction model to obtain a phoneme duration prediction value of at least one word. Second audio information corresponding to the at least one word is generated based on the phoneme duration prediction value of the at least one word and the corresponding linguistic feature. And synthesizing the second audio information and the first audio information to obtain fused audio information. According to the embodiment of the invention, linguistic analysis can be carried out on the text information only by acquiring the text information, and the second audio information generated through the phoneme duration prediction value and the linguistic feature is subjected to duration prediction and duration self-adaptive adjustment through the duration prediction model, so that the second audio information is more easily adapted to the rhythm of the first audio information, and further, the fused audio information with more rhythm can be formed. The finally generated fusion audio information can be closely associated with the acquired text information and the first audio information, and fusion audio information matched with the rhythm can be generated through automatic processing of the text information.

In order to better understand and implement the above-mentioned schemes of the embodiments of the present invention, the following description specifically illustrates corresponding application scenarios.

In the embodiment of the invention, the song can be woven through Artificial Intelligence (Artificial Intelligence), which is an attempt with foresight, and provides reference value for future AI application in a larger scene. Taking the generation of hip-hop Music as an example, ttr (text To Rap), that is, text is converted into Rap Music, for the text information input by the user, after the linguistic features are extracted, duration prediction and duration adaptive adjustment at phoneme level can be performed, the text information is converted into voice, background Music of a specific rhythm is added subsequently, the background Music and the text voice are seamlessly connected To complete hip-hop Music, and finally a segment of beautiful Music with hip-hop characteristics is generated.

In the embodiment of the invention, a section of text input by a user is mainly based, then the text is divided into single words or phrases, phoneme-level duration prediction and duration self-adaptive adjustment can be respectively carried out on the words, and finally the words are converted into voice through a TTS technology. As shown in fig. 2, a schematic diagram of a music generation process based on a duration prediction model according to an embodiment of the present invention mainly includes the following steps:

step 1, extracting parameters from the corpus A.

The corpus A is a training corpus, and training corpus texts are stored in the corpus A.

And 2, extracting phoneme duration from the text of the corpus A.

And 3, carrying out parametric modeling.

And 4, training a model.

And 5, generating a duration prediction model.

The training corpus text in the corpus A can be used for training a duration prediction model, the phoneme duration extracted in the step 2 is the actual phoneme duration included in the word, and the actual phoneme duration can be used for parameters of the training model to enable the rhythm of the voice to be self-adaptive, so that the voice with more rhythm is obtained.

The duration prediction model generated by the embodiment of the invention can mainly predict the duration at a phoneme level. The rhythm of the hip-hop is judged according to the time length, and the hip-hop is different from the common speaking and has rhythmicity, so that a time length self-adaptive adjustment is carried out on the result of time length prediction, and the original pronunciation of each character can be ensured not to be changed on the beat.

In the embodiment of the present invention, the duration prediction model may use a loss function (cost function) to determine whether the model converges. The loss function is the reaction of the model to the degree of fit of the data, and the worse the fit, the greater the value of the loss function should be. When the loss function is larger, the corresponding gradient of the loss function is also larger, so that the updating variable can be updated faster. In our invention, the loss function used is the least square Error criterion (MSE):

where C is used to represent the loss function. G is a duration prediction model, which outputs a prediction vector G (X) according to an input matrix X, and Y is a true value.

Therefore, it can be seen from this loss function that the loss is larger as the euclidean distance between the predicted value g (x) and the true value Y is larger, and vice versa. The derivation process is as follows:

where w is a parameter to be trained in the model G. The meaning of the parameter w is the weight, and the updating of the model weight is the core parameter of the model.

This is explained below in connection with the neural network model. Taking Back Propagation (BP) neural network as an example, the backward Propagation value can be calculated as follows:

wherein the content of the first and second substances,

a representation of the backward-passed values representing the BP neural network,

the learning rate is indicated.

May represent the aforementioned G (x), in a neural network, with the use of the last layer of the lossy layer

And the true value Y yields a loss, and the neural network then trains the duration prediction model by minimizing the value of this loss function.

And 6, extracting parameters from the corpus B.

The corpus B is a test corpus in which test corpus texts are stored.

And 7, extracting phoneme duration from the text of the corpus B.

And 8, carrying out voice self-adaptation.

The phoneme durations of the test text may be predicted after the duration prediction model is generated to obtain an optimal phoneme duration prediction result.

And 9, performing linguistic feature extraction on the text.

The user can input a section of text as lyrics, then perform text analysis on the lyrics, and provide information for subsequent feature extraction, and the method mainly comprises the following steps: pronunciation generation, rhythm prediction, part of speech prediction and the like, and then linguistic feature extraction is carried out, after a text analysis result is obtained, Chinese language feature and linguistic feature extraction is carried out on the result and converted into an input vector of a neural network.

And step 10, adjusting the phoneme duration according to the linguistic characteristics to obtain a phoneme duration prediction result.

The generated duration prediction model can be used for phoneme-level duration prediction, hip-hop music is different from ordinary speech and has rhythmicity, so that duration self-adaptive adjustment is performed on a duration prediction result, each character can be in a beat, and the original pronunciation is guaranteed to be unchanged.

And 11, predicting acoustic characteristics according to the linguistic characteristics and the phoneme duration prediction result.

And step 12, generating voice.

The acoustic features can be predicted by combining the result predicted by the duration prediction model with the previous linguistic features, and the sound is synthesized on the basis of the result.

And step 13, synthesizing the voice and the background music and outputting the music.

Finally, the speech and background music may be synthesized into a piece of music, so that a final song may be generated.

In the embodiment of the present invention, the synthesized voice quality refers to the quality of the voice output by the voice synthesis system, and is generally evaluated subjectively from the aspects of definition (or intelligibility), naturalness, continuity, and the like. The voice synthesis is improved into hip-hop music synthesis, and because hip-hop is different from ordinary speaking and has rhythmicity, a self-adaptive adjustment is made on the result of duration prediction, so that each character can be kept on the beat and the original pronunciation is not changed. In the step of adding music, if speech exists, the speech is converted into music and needs rhythm, so that duration prediction is added, namely, the rhythm of the speech and the music are synthesized, and music with more rhythm can be formed.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

To facilitate a better implementation of the above-described aspects of embodiments of the present invention, the following also provides relevant means for implementing the above-described aspects.

Referring to fig. 3-a, an apparatus 300 for generating audio information according to an embodiment of the present invention may include: an acquisition module 301, a linguistic analysis module 302, a duration prediction module 303, an audio generation module 304, and an audio fusion module 305, wherein,

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text information and first audio information, and the text information comprises at least one word;

the linguistic analysis module 302 is configured to perform linguistic analysis on the text information to obtain linguistic features of at least one word respectively;

the duration prediction module 303 is configured to perform phoneme-level duration prediction and duration adaptive adjustment on the at least one word through a duration prediction model, so as to obtain a phoneme duration prediction value of the at least one word;

an audio generating module 304, configured to generate second audio information corresponding to the at least one word according to the phoneme duration prediction value of the at least one word and the corresponding linguistic feature;

and an audio fusion module 305, configured to synthesize the first audio information and the second audio information to obtain fused audio information.

In some embodiments of the present invention, referring to fig. 3-b, the linguistic analysis module 302 includes:

a sentence-breaking module 3021, configured to perform sentence breaking on the text information to obtain a sub-text of at least one sentence;

the segmentation module 3022 is configured to perform word segmentation on the subfiles of each sentence according to the part of speech and prosody to obtain words corresponding to each subfile;

a feature extraction module 3023, configured to respectively extract a linguistic feature from the word corresponding to each sub-text, so as to obtain the linguistic feature of the at least one word.

In some embodiments of the present invention, referring to fig. 3-c, the audio generating module 304 comprises:

the acoustic prediction module 3041 is configured to perform acoustic feature prediction on the at least one word according to the phoneme duration prediction value of the at least one word and the corresponding linguistic feature, so as to obtain acoustic features corresponding to the at least one word;

a feature conversion module 3042, configured to convert acoustic features corresponding to the at least one word into sound segments corresponding to the at least one word;

the fragment synthesizing module 3043 is configured to synthesize the sound fragments corresponding to the at least one word together to obtain a second audio signal.

In some embodiments of the present invention, referring to fig. 3-d, the apparatus 300 for generating audio information further includes:

a prosody matching module 306, configured to determine whether prosody matching is satisfied between the second audio information and the first audio information according to the predicted phoneme duration value of the second audio information; if the second audio information and the first audio information satisfy the prosody matching, the audio fusion module 305 is triggered to execute.

In some embodiments of the present invention, the prosody matching module 306 is further configured to, if prosody matching is not satisfied between the second audio information and the first audio information, perform prosody matching on the predicted phoneme duration value of the second audio information and audio data in an audio database to obtain audio data screened from the audio database, where each audio data in the audio database corresponds to a prosody feature;

the audio fusion module 305 is further configured to synthesize the generated second audio information and the audio data screened from the audio database, so as to obtain fused audio information.

In some embodiments of the present invention, referring to fig. 3-e, the apparatus 300 for generating audio information further includes:

a sample extraction module 307, configured to extract phoneme durations from training samples in a training corpus;

the model training module 308 is configured to train the phoneme durations of the time duration prediction model by using the extracted phoneme durations as input parameters of the neural network;

the model testing module 309 is configured to test the phoneme time length of the duration prediction model by using a test sample in a test corpus after the duration prediction model is trained;

and an output module 310, configured to output the duration prediction model after the test is completed.

As can be seen from the description of the embodiment of the present invention, the text information and the first audio information are obtained, and then the text information is subjected to linguistic analysis to respectively obtain linguistic characteristics of at least one word, where the at least one word is a word obtained by segmenting a text. And performing phoneme-level duration prediction and duration self-adaptive adjustment on at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word. Second audio information corresponding to the at least one word is generated based on the phoneme duration prediction value of the at least one word and the corresponding linguistic feature. And synthesizing the second audio information and the first audio information to obtain fused audio information. According to the embodiment of the invention, the linguistic analysis can be carried out on the text information only by acquiring the text information, and the second audio information generated through the phoneme duration prediction value and the linguistic feature is subjected to duration prediction and duration self-adaptive adjustment through the duration prediction model, so that the second audio information is more easily adapted to the rhythm of the first audio information, and the fused audio information with more rhythm can be formed. The finally generated fusion audio information can be closely associated with the acquired text information and the first audio information, and fusion audio information matched with the rhythm can be generated through automatic processing of the text information.

Another terminal is provided in the embodiment of the present invention, as shown in fig. 4, for convenience of description, only a part related to the embodiment of the present invention is shown, and details of the specific technology are not disclosed, please refer to the method part in the embodiment of the present invention. The terminal may be any terminal device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales), a vehicle-mounted computer, etc., taking the terminal as the mobile phone as an example:

fig. 4 is a block diagram illustrating a partial structure of a mobile phone related to a terminal according to an embodiment of the present invention. Referring to fig. 4, the handset includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. Those skilled in the art will appreciate that the handset configuration shown in fig. 4 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 4:

RF circuit 1010 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for processing downlink information of a base station after receiving the downlink information to processor 1080; in addition, data for designing uplink is transmitted to the base station. In general, RF circuit 1010 includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1010 may communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), and the like.

The memory 1020 can be used for storing software programs and modules, and the processor 1080 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1020 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1030 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also called a touch screen, may collect a touch operation performed by a user on or near the touch panel 1031 (e.g., an operation performed by a user on or near the touch panel 1031 using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a predetermined program. Optionally, the touch panel 1031 may include two parts, namely a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1080, and can receive and execute commands sent by the processor 1080. In addition, the touch panel 1031 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1030 may include other input devices 1032 in addition to the touch panel 1031. In particular, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a track ball, a mouse, a joystick, or the like.

The display unit 1040 may be used to display information input by a user or information provided to the user and various menus of the cellular phone. The Display unit 1040 may include a Display panel 1041, and optionally, the Display panel 1041 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1031 can cover the display panel 1041, and when the touch panel 1031 detects a touch operation on or near the touch panel 1031, the touch operation is transmitted to the processor 1080 to determine the type of the touch event, and then the processor 1080 provides a corresponding visual output on the display panel 1041 according to the type of the touch event. Although in fig. 4, the touch panel 1031 and the display panel 1041 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1031 and the display panel 1041 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1050, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1041 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1041 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 1060, speaker 1061, microphone 1062 may provide an audio interface between the user and the handset. The audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and convert the electrical signal into a sound signal for output by the speaker 1061; on the other hand, the microphone 1062 converts the collected sound signals into electrical signals, which are received by the audio circuit 1060 and converted into audio data, which are then processed by the audio data output processor 1080 and then sent to another mobile phone via the RF circuit 1010, or output to the memory 1020 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help the user to send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 1070, which provides wireless broadband internet access for the user. Although fig. 4 shows the WiFi module 1070, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1080 is a control center of the mobile phone, connects various parts of the whole mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1020 and calling data stored in the memory 1020, thereby integrally monitoring the mobile phone. Optionally, processor 1080 may include one or more processing units; preferably, the processor 1080 may integrate an application processor, which handles primarily the operating system, user interfaces, applications, etc., and a modem processor, which handles primarily the wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1080.

The handset also includes a power source 1090 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 1080 via a power management system to manage charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In the embodiment of the present invention, the processor 1080 included in the terminal also has a function of controlling and executing the above method flow executed by the terminal.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the software program implementation is a better implementation mode for the present invention in more cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the above embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for generating audio information, the method comprising:

performing linguistic analysis on the text information to respectively obtain linguistic characteristics of the at least one word;

respectively carrying out phoneme-level duration prediction and duration self-adaptive adjustment on the at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word, wherein the duration self-adaptive adjustment is used for enabling each word to be in a beat;

generating second audio information corresponding to the at least one word according to the phoneme duration predicted value of the at least one word and the corresponding linguistic feature;

2. The method according to claim 1, wherein the performing linguistic analysis on the text information to obtain linguistic features of the at least one word respectively comprises:

performing word segmentation on the subfiles of each sentence according to the part of speech and the rhythm to obtain words corresponding to each subfile;

and respectively extracting linguistic characteristics from the words corresponding to each sub-text to obtain the linguistic characteristics of the at least one word.

3. The method of claim 1, wherein generating second audio information corresponding to the at least one word from the phoneme duration prediction value of the at least one word and the corresponding linguistic feature comprises:

respectively predicting acoustic features of the at least one word according to the phoneme duration predicted value of the at least one word and the corresponding linguistic features to obtain acoustic features respectively corresponding to the at least one word;

converting the acoustic features corresponding to the at least one word into sound segments corresponding to the at least one word;

and synthesizing the sound segments respectively corresponding to the at least one word together to obtain the second audio information.

4. The method of claim 1, wherein after generating the second audio information corresponding to the at least one word based on the phoneme duration prediction value and the corresponding linguistic feature of the at least one word, the method further comprises:

and if the second audio information and the first audio information meet rhythm matching, triggering and executing the following steps of synthesizing the first audio information and the second audio information to obtain fused audio information.

5. The method of claim 4, further comprising:

if the second audio information and the first audio information do not meet rhythm matching, carrying out rhythm matching on the phoneme duration predicted value of the second audio information and audio data in an audio database, and screening out audio data from the audio database, wherein each audio data in the audio database corresponds to a rhythm feature;

6. The method according to any one of claims 1 to 5, further comprising:

extracting phoneme duration from training samples in a training corpus;

taking the extracted phoneme duration as an input parameter of a neural network, and training the phoneme duration of the duration prediction model;

after the training of the duration prediction model is finished, testing phoneme duration of the duration prediction model by using a test sample in a test corpus;

and outputting the duration prediction model after the test is finished.

7. An apparatus for generating audio information, the apparatus comprising:

the linguistic analysis module is used for carrying out linguistic analysis on the text information to respectively obtain the linguistic characteristics of the at least one word;

the duration prediction module is used for performing phoneme-level duration prediction and duration self-adaptive adjustment on the at least one word through a duration prediction model to obtain a phoneme duration prediction value of the at least one word, and the duration self-adaptive adjustment is used for enabling each word to be in a beat;

8. The apparatus of claim 7, wherein the linguistic analysis module comprises:

the sentence-breaking module is used for breaking sentences of the text information to obtain at least one sentence of sub-text;

the segmentation module is used for carrying out word segmentation on the subfiles of each sentence according to the part of speech and the rhythm to obtain words corresponding to each subfile;

and the feature extraction module is used for respectively extracting linguistic features from the words corresponding to each subfile to obtain the linguistic features of the at least one word.

9. The apparatus of claim 7, wherein the audio generation module comprises:

the acoustic prediction module is used for respectively predicting acoustic features of the at least one word according to the phoneme duration prediction value of the at least one word and the corresponding linguistic features to obtain the acoustic features respectively corresponding to the at least one word;

the feature conversion module is used for converting the acoustic features corresponding to the at least one word into sound segments corresponding to the at least one word;

and the fragment synthesis module is used for synthesizing the sound fragments respectively corresponding to the at least one word together to obtain the second audio information.

10. The apparatus of claim 7, wherein the means for generating the audio information further comprises:

the prosody matching module is used for judging whether the second audio information and the first audio information meet prosody matching according to the phoneme duration prediction value of the second audio information; and if the second audio information and the first audio information meet rhythm matching, triggering the audio fusion module to execute.

11. The apparatus of claim 10, wherein the prosody matching module is further configured to, if prosody matching is not satisfied between the second audio information and the first audio information, perform prosody matching on the predicted phoneme duration value of the second audio information and audio data in an audio database to obtain audio data screened from the audio database, where each audio data in the audio database corresponds to a prosody feature;

and the audio fusion module is further used for synthesizing the generated second audio information and the audio data screened from the audio database to obtain fused audio information.

12. The apparatus according to any one of claims 7 to 11, wherein the means for generating the audio information further comprises:

the sample extraction module is used for extracting phoneme duration from training samples in the training corpus;

the model training module is used for taking the extracted phoneme duration as an input parameter of the neural network and carrying out phoneme duration training on the duration prediction model;

the model testing module is used for testing the phoneme duration of the duration prediction model by using a testing sample in a testing corpus after the duration prediction model is trained;

and the output module is used for outputting the duration prediction model after the test is finished.

13. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1-6.