CN116403561A - Method and device for manufacturing audio book and storage medium - Google Patents

Method and device for manufacturing audio book and storage medium Download PDF

Info

Publication number
CN116403561A
CN116403561A CN202310312863.1A CN202310312863A CN116403561A CN 116403561 A CN116403561 A CN 116403561A CN 202310312863 A CN202310312863 A CN 202310312863A CN 116403561 A CN116403561 A CN 116403561A
Authority
CN
China
Prior art keywords
scene
sound
audio
sentence
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310312863.1A
Other languages
Chinese (zh)
Inventor
徐东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202310312863.1A priority Critical patent/CN116403561A/en
Publication of CN116403561A publication Critical patent/CN116403561A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a method and a device for manufacturing a voice book and a storage medium, which are used for the technical field of audio. The method comprises the following steps: acquiring a text corresponding to the audio book; determining target sentences related to roles and scenes in the text; performing sounding processing on the target sentence according to the audio features corresponding to the character information of the target sentence to obtain character reading sounds matched with the audio features; obtaining scene sound effects matched with the scene information according to the scene information of the target sentence; determining the sentence position of the scene information in the target sentence, and adding the scene sound effect into the audio segment of the character reading sound corresponding to the sentence position to obtain the target audio corresponding to the target sentence. By adding the corresponding scene sound effect into the character reading sound of the target sentence, the audio of the target sentence not only contains the character reading sound but also contains the scene sound effect, and the audio hearing effect of the voice book is improved.

Description

Method and device for manufacturing audio book and storage medium
Technical Field
The embodiment of the application relates to the technical field of audio, in particular to a method and a device for manufacturing a voice book and a storage medium.
Background
In the existing voice book, characters are generally read as audio through a reading mode, and a user can acquire information of the characters through an listening mode, such as a novel voice book. The reading mode can be manual reading or technology generation.
The existing method for making the voice book through manual reading needs to read texts one by one, and when the texts of the texts are more, the method is huge in the aspects of making cost and making time, and the generating efficiency is lower. With the development of the deep neural network technology, the AI voice book directly generated by the AI reading technology can quickly synthesize the reading sound; AI speaks, a technique that converts text into sound through a deep neural network. The existing AI reading is synthesized through a voice synthesis technology of converting text into voice, generally when converting text into voice, the AI reading distinguishes different roles corresponding to each sentence in the text, and the sentences with different roles are converted into voice waveforms with corresponding audio characteristics of the roles, so that the voice of the character reading corresponding to the sentence is obtained.
However, in the existing method for making the audio book through AI (analog input) reading, sentences of the text corresponding to the characters only contain the sounds of the characters reading, the obtained reading sounds are monotonous, and the hearing effect of the audio book is poor.
Disclosure of Invention
The embodiment of the application provides a method and a device for manufacturing a voice book and a storage medium, which can effectively improve the audio hearing effect of the voice book.
The embodiment of the application provides a method for manufacturing a voice book, which comprises the following steps:
acquiring a text corresponding to the audio book;
determining target sentences related to roles and scenes in the text;
performing sounding processing on the target sentence according to the audio features corresponding to the character information of the target sentence to obtain character reading sounds matched with the audio features;
obtaining scene sound effects matched with the scene information according to the scene information of the target sentence;
determining the sentence position of the scene information in the target sentence, and adding the scene sound effect into the audio segment of the character reading sound corresponding to the sentence position to obtain the target audio corresponding to the target sentence.
Further, the determining the sentence position of the scene information in the target sentence includes;
acquiring a first phoneme sequence of a scene content word corresponding to the scene information;
and determining the sequence position of the first phoneme sequence in the phoneme sequence corresponding to the target sentence.
Further, the adding the scene sound effect to the audio segment of the character reading sound corresponding to the sentence position includes:
determining an audio frame sequence corresponding to the first phoneme sequence in the character reading sound according to the sequence position;
taking a preset audio frame corresponding to the audio frame sequence as an audio effect initial frame of the scene audio effect;
determining the duration of the sound effect of the scene according to the sound source in the scene information;
and adding the scene sound effect into the character reading sound according to the sound effect starting frame and the sound effect duration.
Further, the determining the duration of the sound effect of the scene according to the sound source in the scene information includes:
if the volume change value of the sound source in the preset duration is greater than a preset volume threshold, determining that the scene sound effect is a trigger sound, and the duration of the sound effect of the trigger sound is less than the preset duration;
if the volume change value of the sound source in the preset duration is smaller than a preset volume threshold, determining that the scene sound effect is environmental background sound, and the duration of the sound effect of the environmental background sound is longer than the preset duration.
Further, the adding the scene sound effect into the character reading sound includes:
if a plurality of scene sound effects are located in the same audio frame and the trigger sound and the environmental background sound are located in the plurality of scene sound effects, the volume of the trigger sound is increased based on a preset volume balance technology, so that the volume of the trigger sound is higher than that of the environmental background sound.
Further, adding the scene sound effect into the character reading sound according to the sound effect starting frame and the sound effect duration comprises:
fade the said scene sound effect into the said role reads the sound in the said sound effect initial frame, and continue to strengthen the volume of the said scene sound effect until reaching the preset volume;
and reducing the volume of the scene sound effect in a preset ending period of the sound effect duration, and fading out the character reading sound from the scene sound effect.
Further, the determining the target sentences related to the roles and the scenes in the text comprises:
determining whether preset role dialogue information exists in semantic information of sentences in the text or whether the semantic information of the sentences in the text is matched with semantic information in a preset role sentence set;
If the preset role dialogue information exists or is matched with the semantic information in the preset role sentence set, determining that the sentence is a role sentence related to the role;
and if the role sentences have preset scene semantics, determining the role sentences as the target sentences.
Further, the step of performing the sounding processing on the target sentence according to the audio feature corresponding to the character information of the target sentence, and the step of obtaining the character reading sound matched with the audio feature includes:
inputting the target sentence into a preset reading model, and determining tone characteristics corresponding to character information of the target sentence in the preset reading model;
converting the phoneme sequence corresponding to the target sentence into a target audio feature based on the tone color feature corresponding to the character information;
and converting the target audio characteristics into sound waveforms to obtain character reading sounds matched with the audio characteristics.
Further, the obtaining the scene sound effect matched with the scene information according to the scene information of the target sentence includes:
and determining scene sound effects matched with the scene information from a preset sound effect library according to preset scene semantics in the scene information, wherein the preset sound effect library comprises scene sound effects corresponding to various scene semantics.
The embodiment of the application also provides a device for making the audio book, which comprises:
a central processing unit, a memory and an input/output interface;
the memory is a short-term memory or a persistent memory;
the central processor is configured to communicate with the memory and to execute the instruction operations in the memory to perform the methods described above.
Embodiments of the present application also provide a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the above-described method.
From the above technical solutions, the embodiments of the present application have the following advantages:
the method comprises the following steps: acquiring a text corresponding to the audio book; determining target sentences related to roles and scenes in the text; performing sounding processing on the target sentence according to the audio features corresponding to the character information of the target sentence to obtain character reading sounds matched with the audio features; obtaining scene sound effects matched with the scene information according to the scene information of the target sentence; determining the sentence position of the scene information in the target sentence, and adding the scene sound effect into the audio segment of the character reading sound corresponding to the sentence position to obtain the target audio corresponding to the target sentence. By adding the corresponding scene sound effect into the character reading sound of the target sentence, the character reading sound corresponding to the target sentence is combined with the scene sound effect, so that the audio of the target sentence not only contains the character reading sound, but also contains the scene sound effect, and the audio hearing effect of the audio book is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1 is a diagram of a communication architecture for audio book fabrication according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for producing an audio book according to an embodiment of the present application;
FIG. 3 is a flow chart of another audio book according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a voice book according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an acoustic model according to an embodiment of the present disclosure;
FIG. 6 is a diagram of a device for producing an audio book according to an embodiment of the present application;
fig. 7 is a diagram of another apparatus for producing a voice book according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "one embodiment" or "one embodiment" and the like, which describe a subset of all possible embodiments, but it is to be understood that "one embodiment" or "one embodiment" may be the same subset or a different subset of all possible embodiments and may be combined with each other without conflict. In the following description, the term plurality refers to at least two.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
The existing making process of the voice book is that the corresponding text is obtained and then converted into the corresponding audio, and the user can obtain the information corresponding to the text by listening to the corresponding audio. As shown in fig. 1, the existing audio book making device 101 is connected to the audio book reader 102, and the connection may be a wired or wireless network connection, which is not limited herein. It will be appreciated that the audio book reader 102 may be a mobile phone, a palm electronic book, etc., and the text may be a novel or other written work, and is not limited herein. The audio book making device 101 may obtain the text of the audio book to be made from the audio book reader 102, convert the text into audio, and transmit the audio to the audio book reader 102, where the audio book reader 102 may play the corresponding audio when displaying the text. The audio book making device 101 may be coupled to one or more audio book readers 102, and is not limited in this regard. The audio book producing device 101 may be provided inside the audio book reader 102 or outside the audio book reader 102, and is not limited thereto. The existing AI reading is synthesized through a voice synthesis technology of converting text into voice, generally when converting text into voice, the AI reading distinguishes different roles corresponding to each sentence in the text, and the sentences with different roles are converted into voice waveforms with corresponding audio characteristics of the roles, so that the voice of the character reading corresponding to the sentence is obtained. However, in the existing method for making the audio book through AI (analog input) reading, sentences of the text corresponding to the characters only contain the sounds of the characters reading, the obtained reading sounds are monotonous, and the hearing effect of the audio book is poor. Therefore, the embodiment of the application provides a method for manufacturing a voice book, which can improve the audio hearing experience of the voice book, as shown in fig. 2, specifically as follows:
201. And acquiring a text corresponding to the voice book.
In this embodiment of the present application, the apparatus for producing a vocal book may obtain a text corresponding to the vocal book. Specifically, a text to be made into the audio book can be obtained from the audio book reader connected with the audio book making device, and the text is converted into audio and then the audio is transmitted back to the audio book reader. It will be appreciated that the device for producing a vocal book may obtain the text corresponding to the vocal book from the connected vocal book reader, and may also obtain the text corresponding to the vocal book from the repository of the device for producing a vocal book, which is not limited herein.
202. A target sentence in the text associated with the character and associated with the scene is determined.
After the making device of the audio book obtains the text corresponding to the audio book, target sentences related to the roles and related to the scenes in the text can be determined, and it can be understood that the related to the roles refers to sentences described by sounding the preset roles in the text, and the related to the scenes refers to corresponding scene information in the sentences. The target sentence is related to the character and the scene at the same time, namely, the target sentence described by the sounding of the preset character has corresponding scene information. It will be appreciated that the target sentence refers to a sentence in text that generally contains a plurality of target sentences; specifically, it may be determined whether the sentence is a target sentence related to a character and related to a scene according to semantic information of the sentence in the text. The method may determine whether the sentence is a sentence related to a role according to role information in semantic information of the sentence, the role information refers to reading a role corresponding to the sentence, if the sentence is "role a speaks to role B", the role corresponding to the sentence is role a, the role information may be understood as a role name, and the role may be old, young, students, teachers, etc., which is not limited specifically herein. After determining that the sentence is a sentence related to a role, it may be further determined whether the sentence is a sentence related to a scene, and whether the semantic information of the sentence has scene information, where the scene information refers to a scene appearing in the sentence, for example, when the sentence includes "engine starting sound for starting an automobile", the scene corresponding to the sentence is a scene for starting an engine. If the sentence is further determined to be a sentence related to a scene, the sentence is determined to be a target sentence related to a character and related to the scene. It is to be understood that it is also possible to determine whether a sentence in the text is a sentence related to a scene, and then determine whether the sentence is related to a character, which is not limited herein.
203. And carrying out sounding processing on the target sentence according to the audio characteristics corresponding to the character information of the target sentence to obtain character reading sounds matched with the audio characteristics.
After determining target sentences related to roles and related to scenes in the text, the target sentences can be subjected to sounding processing according to the audio features corresponding to the role information of the target sentences to obtain the role reading sounds matched with the audio features. The audio feature corresponding to the character information of the target sentence refers to an audio feature of pronunciation (speaking) of the character, and the audio feature includes tone color, loudness, and the like, which is not limited herein. If the role of the target sentence is old people, the corresponding audio feature is a cloudy audio feature, and if the role of the target sentence is student, the corresponding audio feature is a light audio feature. And carrying out sounding processing on the target sentence according to the audio characteristics corresponding to the character information of the target sentence to obtain a reading sound matched with the audio characteristics, wherein the reading sound is consistent with the audio characteristics. If the role of the target sentence is student, the reading sound of the target sentence is lighter sound.
204. And obtaining scene sound effects matched with the scene information according to the scene information of the target sentence.
After determining the target sentences related to the roles and the scenes in the text, the scene sound effect matched with the scene information can be obtained according to the scene information of the target sentences. Specifically, the scene sound effect matched with the scene information can be obtained according to the scene semantics in the scene information. If the scene semantic is "the train comes", the scene sound effect can be determined to be the sound of the train, and at this time, the sound of the train can be downloaded from the network or searched from the sound effect library of the making device of the sound book, and the method is not limited herein.
It is understood that the execution sequence of step 203 and step 204 is not limited.
205. And adding scene sound effects into the audio segments of the character reading sounds corresponding to the scene information to obtain target audio corresponding to the target sentences.
After the character reading sound and the scene sound effect matched with the scene information are obtained, the scene sound effect can be added into the audio segment of the character reading sound corresponding to the scene information, and the target audio corresponding to the target sentence is obtained. Specifically, the sentence position of the scene information in the target sentence can be determined, and the scene sound effect is added into the audio segment of the character reading sound corresponding to the sentence position, so that the target audio corresponding to the target sentence is obtained. It can be understood that the scene content text describing the scene information occupies a certain sentence position in the target sentence, for example, the target sentence is "Zhang Sansay: i hear the knock, at this time, the scene content text of the scene information is "knock", and the sentence position of "knock" in the target sentence can be determined. Adding a scene sound effect in an audio segment of the character reading sound corresponding to the sentence position, wherein when the character reading sound reads the scene content text of the scene information, the scene sound effect is added in the audio segment of the reading scene content text; for example, when "knocker" is read aloud, scene sound effects can be added to the audio segment. The scene sound effect may be added in the first audio frame of the speakable scene content text or in the first frames of the speakable scene content text, which is not limited herein. For example, "knock" may be added to the first audio frame in which the "knock" word is spoken, or may be added to the first few frames in which the "knock" word is spoken, as well as the field Jing Yinxiao.
It can be understood that semantic information of different target sentences in the text is different, the obtained character information and scene sound effects are also different, when character reading sounds and scene sound effects of a plurality of target sentences are mixed, target audio containing characteristics of multiple characters and multiple scene sound effects can be obtained, and when the target audio is played, the audio of the current sentence is played with reading sounds matched with audio characteristics corresponding to the character information and the scene sound effects of the current sentence.
As can be seen, the method of the embodiment of the present application includes: acquiring a text corresponding to the audio book; determining target sentences related to roles and scenes in the text; performing sounding processing on the target sentence according to the audio features corresponding to the character information of the target sentence to obtain character reading sounds matched with the audio features; obtaining scene sound effects matched with the scene information according to the scene information of the target sentence; determining the sentence position of the scene information in the target sentence, and adding the scene sound effect into the audio segment of the character reading sound corresponding to the sentence position to obtain the target audio corresponding to the target sentence. By adding the corresponding scene sound effect into the character reading sound of the target sentence, the character reading sound corresponding to the target sentence is combined with the scene sound effect, so that the audio of the target sentence not only contains the character reading sound, but also contains the scene sound effect, and the audio hearing effect of the audio book is improved.
The process of making the audio book is described above, and the process of making the audio book will be described in detail with reference to fig. 3, which includes the following specific steps:
301. and acquiring a text corresponding to the voice book.
It should be noted that step 301 is similar to step 201 described above, and detailed description thereof will not be repeated here.
302. And determining target sentences related to the roles and related to the scenes according to semantic information of sentences in the text.
According to the method and the device for producing the voice book, target sentences related to roles and scenes can be determined according to semantic information of sentences in the text. Wherein, whether the semantic information of the sentences in the text stores the dialogue information of the preset roles or whether the semantic information of the sentences in the text is matched with the semantic information in the sentence set of the preset roles can be determined; if the preset role dialogue information exists or is matched with the semantic information in the preset role sentence set, determining that the sentence is a role sentence related to the role; and if the preset scene semantics exist in the role sentences, determining the role sentences as target sentences.
It will be appreciated that. The creating means of the audio book may analyze whether the semantic information of the sentence in the text has character information to determine whether the sentence is a character sentence related to the character. Determining the character information includes classifying or not classifying the characters of sentences in the text. Specifically, as shown in fig. 4, whether the multi-persona deduction is included or not may be selected according to the text type, for example, for news types, typically, a single person broadcasts, character classification is not needed, for novels or dramas, multiple characters are typically included, and each character represents a different character feature, and character classification is needed. When the characters are classified, text types including a plurality of characters such as novels, dramas and the like can be processed through a natural language processing technology, and splitting results of different characters are obtained. For example, the current sentence is character Zhang Sansay, the next sentence is character Li Sansay, and the last sentence is bystandstill. Determining whether semantic information of a sentence in the text is matched with semantic information in a preset role sentence set can be determined for a natural language processing model based on big data training, and determining whether preset role dialogue information exists in the semantic information of the sentence in the text can be based on a rule post-processing method, wherein determining whether the semantic information of the sentence in the text is matched with the semantic information in the preset role sentence set is specifically determined in a determining mode A, and determining whether the preset role dialogue information exists in the semantic information of the sentence in the text is determined in a determining mode B.
And (3) determining whether semantic information of sentences in the text is matched with semantic information in a preset role sentence set or not in the mode A.
When the making device of the voice book determines whether the semantic information of the sentence in the text is matched with the semantic information in the sentence set of the preset role, the semantic information of the sentence in the text can be compared with the semantic information in the sentence training set, and whether the role information exists in the sentence is determined. Specifically, a preset language processing model can be trained based on a sentence training set, and each sentence in the sentence training set is provided with a corresponding role label; the sentence training set can be Chinese text big data, the preset language processing model can be a Bert model, text characteristics of a paragraph where a current sentence is located can be extracted, the text characteristics refer to a characterization vector extracted based on a natural language processing model, the characterization vector is a set of data used for representing the position of the current text in a feature space, and texts with different characteristics can be in different feature spaces, for example, three sentences before and after the current sentence are used as a paragraph. In the training stage, the supervised training can be performed through a preset language processing model by marking the roles of sentences in the text, such as the role name of each sentence of the novel. When determining whether the sentences have character information, the sentences to be classified can be input, and if the predicted character names can be obtained, the sentences can be determined to have the character information; specifically, a sentence can be input into a preset language processing model, semantic information of the sentence is compared with semantic information in a sentence training set in the preset language processing model, and if the sentence training set has a first sentence matched with the sentence, character information of the sentence is determined; after determining that the sentence has the character information, the character label corresponding to the first sentence can be used as the character information corresponding to the sentence. It can be understood that the preset language processing model can learn sentences in the sentence training set and corresponding role labels by itself in the training process, specifically, the training process can select specific fields with the same or a certain probability of occurrence times from sentences belonging to the same preset role label, then can judge whether the sentences have specific fields, if yes, determine that the sentences have role information, and determine that semantic information of the sentences in the text is matched with the semantic information in the preset role sentence set.
And B, determining whether preset role dialogue information exists in semantic information of sentences in the text.
Determining whether the semantic information of the sentence in the text has the preset role dialogue information is a rule-based post-processing method, namely determining that the semantic information of the sentence in the text has the preset role dialogue information by analyzing the role field contained in the current sentence when the preset role dialogue information appears. When the preset role dialogue information is determined to exist, the role information of the sentences can be determined, and specifically, whether the preset role dialogue information exists or not is determined in the semantic information of the sentences; if yes, determining the name of the speaking role from the preset role dialogue information; and taking the speaking role name as role information corresponding to the sentence. For example, sentences are: zhang san vs Lisi four terms: "weather today really good! ", the sentence contains two character names of" Zhang Sano "and" Liqu ", and there is a certain someone to say, then the character of the sentence corresponds to" Zhang Sano ". If the sentence is a continuous dialogue and a person is not shown and marked, the reader can only estimate who the speaker is through semantic understanding, and for sentence patterns with such roles not obvious, the prediction result (mode A) of the natural language processing model can be used to determine the role information of the sentence.
It can be understood that only one of the mode a and the mode B may be executed, or the mode B may be further executed after the mode a is executed, and after determining that the sentence is a sentence related to the role, the role information of the sentence may be more accurately determined by comparing the role information obtained by the two modes. If the execution mode A is finished to obtain first character information, the execution mode B is finished to obtain second character information, and when the first character information is the same as the second character information, the character information of the sentence is determined to be the first character information (namely the second character information); when the first character information is different from the second character information, the second character information may be taken as character information of the target sentence, and the second character information of the mode B may be taken as the control.
303. And performing sounding processing on the target sentence according to the audio characteristics corresponding to the character information of the target sentence in the preset reading model to obtain character reading sound matched with the audio characteristics.
After determining target sentences related to the roles and related to the scenes, the target sentences can be subjected to sounding processing according to the audio features corresponding to the role information of the target sentences in a preset reading model, so that the role reading sounds matched with the audio features are obtained. The target sentence can be input into a preset speakable model, and tone characteristics corresponding to the character information of the target sentence are determined in the preset speakable model; converting the phoneme sequence corresponding to the target sentence into a target audio characteristic based on the tone characteristic corresponding to the character information; and converting the target audio characteristics into sound waveforms to obtain character reading sounds matched with the audio characteristics.
Specifically, in the preset speakable model, the target sentence is voiced into a sound waveform based on the audio feature corresponding to the character information of the target sentence. The preset speakable model can be an AI speakable model, AI speakable is generally generated by technical drive, and sentences with different roles can generate sound contents corresponding to different timbres through a voice synthesis technology. For example, the sentence content of Zhang San is input into the AI read-aloud model, and the sentence content is generated by technology, so that the sound of the first tone color can be obtained; the sentence content of Liqu is input into the AI read-aloud model, and the sound of the second tone color can be obtained through technical generation, and the first tone color and the second tone color are different.
Specifically, the audio features corresponding to the character information include tone features, the target sentence is input into a preset speakable model, and the preset speakable model includes: an acoustic front end, an acoustic model, and a vocoder; the acoustic front end comprises a text-to-phoneme (such as Chinese G2P), word segmentation (such as jieba word segmentation), text regularization and the like, and the input text is converted into a phoneme sequence, namely, a target sentence is converted into the phoneme sequence at the acoustic front end. Specifically, the text can be converted into a phoneme sequence according to the pronunciation sequence of the text in the target sentence, if the input target sentence is "weather good", the corresponding pinyin sequence is "tie 1qi4hao3", then the pinyin sequence is further disassembled according to initial consonants, and information such as word segmentation, rhythm and the like is added as the phoneme sequence. The acoustic model may be a non-autoregressive based scheme (e.g., fastspeech), an autoregressive based scheme (e.g., tacotron), or the like, that converts a sequence of phonemes into audio features (e.g., mel spectrum). Specifically, the tone color characteristics corresponding to the character information of the target sentence are determined in the acoustic model, and the phoneme sequence is converted into the target audio characteristics based on the tone color characteristics corresponding to the character information, namely, the phoneme sequence is converted into the audio characteristics corresponding to each audio frame. The audio features include tone, energy, fundamental frequency, etc. information. The vocoder may be an antagonistic network-based scheme (e.g., hifigan), a cyclic neural network-based scheme (e.g., wanmann), etc., converting the audio features into sound waveforms, i.e., converting the target audio features into sound waveforms in the vocoder, resulting in a speakable sound that matches the audio features. It can be understood that the AI read-aloud model can be trained and learned in advance to obtain tone color features corresponding to each character, then different tone color features are selected according to different character names when the AI read-aloud model is used, a phoneme sequence is converted into audio features based on the corresponding tone color features, and the audio features are converted into sound waveforms. It can be understood that in the character speakable sound of the character dialogue, the uttered character name is not generally included, but the content text uttered by the character is vocalized into the character speakable sound, for example, the target sentence is "Zhang Sansay: i hear the knock ", then the character of Zhang three speaks sound as: i hear the knock.
Further, in order to provide more timbre options as possible for the AI speakable model, the acoustic model may be modified, as shown in fig. 5, where the character information of the target sentence includes: character identification of the target sentence; a plurality of character identifiers and corresponding tone characteristics can be added into an audio coding network of the acoustic model; and determining a target character identifier matched with the character identifier of the target sentence from the plurality of character identifiers, and taking the tone color feature corresponding to the target character identifier as the tone color feature corresponding to the character identifier of the target sentence. It will be appreciated that ID code information of different speakers (characters) and audio feature code information of the input voice may be added to the audio coding network of the input voice. The ID code information may be obtained through a convolutional network, the audio feature code information may be obtained through a pre-trained tone extraction model, such as wav2vec, to extract feature vectors containing tone information, and content information unrelated to tone, i.e., to perform tone and content decoupling on the input speech. Then, after the audio features predicted by the acoustic model, a feature classification network is added for analyzing whether the audio features predicted by the acoustic model are the expected timbres, and the post-processing can help the acoustic model to better model the input timbres of different timbres. In the figure, the Encoder is the Encoder of the acoustic model, the varianceadapter is the variable adapter, and the Decoder is the Decoder. Meanwhile, the vocoder can be trained by using the voices with different tone colors, the obtained vocoder can learn the pronunciation characteristics of the different tone colors, and corresponding sound waveforms can be more accurately synthesized during use.
Furthermore, the sound processing can be performed in a manual reading mode, and the target sentence can be manually read into a sound waveform based on tone characteristics corresponding to the role information of the target sentence. Specifically, the human-speaking (real-person-speaking) refers to that the target sentence is manually read as a sound waveform based on tone characteristics corresponding to character information of the target sentence. Wherein the audio features include tone features; selecting a target role needing to be read manually from a plurality of role information contained in the text; it can be appreciated that, after the text is character classified, a plurality of sentences belonging to the same character can be divided into a group; and transmitting sentences corresponding to the target roles to a live host player with tone characteristics corresponding to the target roles to record live sounds, receiving sound waveforms returned by the live host player, and taking the recorded sounds as reading sounds with matched audio characteristics. It can be understood that, for the real person to read, the recording party can select the character sentences to be read by the real person, send the character sentences to the appointed real person anchor through the network to record the sound, and then transmit the recorded sound to the recording party through the network. Taking a novel as an example, the human body can comprise tens of roles, and the roles relate to age changes of the old, the middle, the young and sex differences of men and women, and the human body can provide sound content of human body for part of roles.
304. And determining scene sound effects matched with the scene information of the target sentences according to the preset sound effect library.
After determining the target sentences related to the roles and the scenes, determining scene sound effects matched with the scene information of the target sentences according to a preset sound effect library, wherein the scene sound effects matched with the scene information can be determined from the preset sound effect library according to preset scene semantics in the semantic information of the target sentences, and the preset sound effect library comprises scene sound effects corresponding to various scene semantics. Whether preset scene semantics exist in the semantic information of the target sentence or not can be judged; if yes, acquiring scene sound effects matched with preset scene semantics from a preset sound effect library. It can be understood that the preset sound effect library can be pre-stored with various scene sound effects, such as whistling sounds, reading sounds and the like, and can also acquire more scene sound effects in real time through network connection; the preset sound effect library is generally arranged in a device for manufacturing a voice book. By analyzing semantic information of the input text, when there is a semantic conforming to a specified scene, corresponding sounds in the sound effects library are used. By natural language processing technology, whether the current sentence contains specified scene semantics, such as 'train coming', then a sound file of the train whistle is found from the sound effect library. Such as "walk over with high-heeled shoes," i.e., a sound file of the high-heeled shoe footprint is found from the sound library.
It should be understood that the execution sequence of step 303 and step 304 is not limited herein.
305. A sentence position of the scene information in the target sentence is determined.
In the embodiment of the application, when mixing the voice of the character reading and the sound effect of the scene, the sentence position of the scene information in the target sentence needs to be determined. The first phoneme sequence of the scene content text corresponding to the scene information can be obtained, namely, the scene content text is converted into the first phoneme sequence. It can be understood that the phonemes are the minimum phonetic units divided according to the natural attribute of the language, each word has a corresponding phoneme, and the phonemes can be represented by pinyin; if the target sentence is 'Zhang Sansay': i hear the knock ", at which point the scene content text is: the gate was knocked, the corresponding first phoneme sequence was "qi, ao, me, n". After obtaining the first phoneme sequence corresponding to the scene information, determining a sequence position of the first phoneme sequence in the phoneme sequence corresponding to the target sentence, and taking the sequence position as the sentence position. It is understood that the target sentence may be converted into a corresponding phoneme sequence, and further, a sequence position of the first phoneme sequence in the phoneme sequence corresponding to the target sentence is determined.
306. And adding scene sound effects into the audio segments of the character reading sounds corresponding to the sentence positions to obtain target audio corresponding to the target sentences.
After determining the sentence position of the scene information in the target sentence, adding scene sound effects into the audio segment of the character reading sound corresponding to the sentence position to obtain target audio corresponding to the target sentence; the scene sound effect can be added to the audio segment of the character reading sound according to the sequence position of the first phoneme sequence in the phoneme sequence corresponding to the target sentence, and the steps comprise the following steps 3061 to 3064.
3061. And determining an audio frame sequence corresponding to the first phoneme sequence of the scene content text corresponding to the scene information in the character reading sound.
In the embodiment of the application, the audio frame sequence corresponding to the first phoneme sequence of the scene content text corresponding to the scene information in the character reading sound can be determined. Specifically, after determining that the first phoneme sequence is located at a sequence position in a phoneme sequence corresponding to the target sentence, determining an audio frame sequence corresponding to the first phoneme sequence in the character reading sound according to the sequence position. The target sentences are as follows: "Zhang Sansay: i hear a knock, the first phoneme sequence is "qi, ao, me, n", and the character speaks sound is Zhang Sanspeaks: i hear the sound of the knock; the character reading sound can be converted into a corresponding reading phoneme sequence according to the acoustic characteristics of the character reading sound, for example, the character reading sound is converted into the corresponding reading phoneme sequence through an RNNLSTM voice recognition technology; it can be understood that the character reading sound is obtained by performing the phonation processing on the target sentence according to the audio feature, and the reading phoneme sequence obtained by converting the character reading sound is the same as the phoneme sequence corresponding to the target sentence. In the speakable phoneme sequence, each of the speakable phonemes corresponds to a plurality of audio frames. The first phoneme sequence may be compared with the speakable phoneme sequence, that is, a phoneme sequence "qi, ao, me, n" is found out from the speakable phoneme sequence, and a plurality of audio frames corresponding to each of the phonemes in the phoneme sequence are determined, so as to determine an audio frame sequence corresponding to the first phoneme sequence in the character speakable sound, that is, determine an audio segment corresponding to the phoneme sequence "qi, ao, me, n" in the character speakable sound.
3062. And taking the preset audio frame corresponding to the audio frame sequence as an audio effect initial frame of the scene audio effect.
In the embodiment of the application, it is required to determine which audio frame in the character reading sound starts, and add the scene sound effect, that is, determine the starting frame of the scene sound effect in the character reading sound; after determining the corresponding audio frame sequence of the first phoneme sequence in the character reading sound, the preset audio frame corresponding to the audio frame sequence can be used as an audio effect starting frame of the scene audio effect. Specifically, the first audio frame in the audio frame sequence may be used as an audio start frame of a scene audio, or the previous audio frame in the audio frame sequence may be used as an audio start frame of a scene audio, or the last audio frame in the audio frame sequence may be used as an audio start frame of a scene audio, which is not limited herein. If the first phoneme sequence is "qi, ao, me, n", the first audio frame of the plurality of audio frames corresponding to the pronunciation phoneme "qi" may be used as an audio start frame of the scene audio in a pronunciation sequence, or the previous audio frame of the first audio frame may be used as an audio start frame of the scene audio. It can be understood that the sound effect start frame is an audio frame in the character reading sound, and is used for indicating that the scene sound effect starts to be added at the sound effect start frame.
3063. And determining the duration of the sound effect of the scene according to the sound source in the scene information.
In the embodiment of the application, when the scene sound effect is added into the character reading sound, the ending time of the scene sound effect needs to be determined. Wherein, the duration of the sound effect of the scene can be determined according to the sound source in the scene information; specifically, the type of the scene sound effect can be determined according to the volume change condition of the sound source in the actual scene; specifically, in an actual scene, if the volume change value of the sound source in the preset duration is greater than the preset volume threshold, determining that the sound effect of the scene is a trigger sound, and the duration of the sound effect of the trigger sound is smaller than the preset duration; the preset duration may be 10 seconds or 12 seconds, which is not limited herein; the preset volume threshold may be 50dB or 60dB, which is not limited herein; the preset duration may be 2 seconds or 3 seconds, and is not particularly limited herein. It can be understood that if the volume change value of the sound source within the preset duration is greater than the preset volume threshold, it can be determined that the volume change of the sound emitted by the sound source is abrupt, such as a knocker, a drop of an object, etc., and the duration of the sound source is generally shorter, where the scene sound effect corresponding to the sound source can be determined as a trigger sound; the duration of the sound effect of the trigger tone is generally shorter, and the duration of the sound effect of the trigger tone can be set to be 1 second or 2 seconds, namely, the duration of the sound effect of the scene is set to be 1 second or 2 seconds.
If the volume change value of the sound source in the preset duration is smaller than the preset volume threshold, determining that the scene sound effect is the environmental background sound, and the duration of the sound effect of the environmental background sound is longer than the preset duration. Wherein, the duration of the sound effect of the environmental background sound is longer than that of the trigger sound. It can be understood that if the volume change value of the sound source within the preset duration is smaller than the preset volume threshold, the volume change of the sound emitted by the sound source is relatively gentle, such as bell sound, car flow sound, and the like, which continuously surrounds the sound source, and the duration of the sound source is generally longer, at this time, it can be determined that the scene sound effect corresponding to the sound source is an environmental background sound, the duration of the sound effect of the environmental background sound is generally longer, and the duration of the sound effect of the environmental background sound can be set to be 10 seconds or 8 seconds, that is, the duration of the sound effect of the scene sound is 10 seconds or 8 seconds. It can be understood that the sound effect of the environmental background sound can be set to last until the end of the target sentence of the character reading, so that the corresponding environmental background sound exists from the beginning of the sound effect starting frame to the end of the character reading sound in the character reading sound of the target sentence.
3064. And adding scene sound effects into the character reading sound according to the sound effect starting frame and the sound effect duration.
After the sound effect initial frame and the sound effect duration of the scene sound effect are determined, the scene sound effect can be added into the character reading sound according to the sound effect initial frame and the sound effect duration. Specifically, the scene sound effect can be faded into the character reading sound in the sound effect initial frame corresponding to the character reading sound, and the volume of the scene sound effect is enhanced until reaching the preset volume, and the preset volume is generally smaller than the volume of the character reading sound because the added scene sound effect is used as background sound; adding the low-volume scene sound effect into the character reading sound in the sound effect starting frame, and then continuously enhancing the volume of the scene sound effect until the preset volume is reached. And reducing the volume of the scene sound effect in the preset ending time period of the sound effect duration, and fading out the character-read sound of the scene sound effect, namely, gradually reducing the volume of the scene sound effect from the preset volume when the sound effect duration is about to end until the volume of the scene sound effect is reduced to zero or the scene sound effect is not heard. It can be understood that the volume of the scene sound effect is continuously enhanced when the scene sound effect is faded in, and when the scene sound effect is enhanced to the preset volume, the preset volume is kept until the character reading sound is faded out, or the character reading sound can be faded out directly without being enhanced to the preset volume.
It can be understood that adding scene sound effects into character reading sounds adopts fade-in fade-out, so that when two sounds (the character reading sounds and the scene sound effects) are overlapped in time, the click feeling caused by hard splicing is avoided. The target sentences like character reading are as follows: zhang Sanzhi: when the player hears the knock sound, the first audio frame of the plurality of audio frames corresponding to the pronunciation phoneme qi is used as an audio starting frame of the scene audio, the scene audio is faded into the character reading sound, and at the moment, the duration of the scene audio is 2 seconds, and the scene audio can be faded out of the character reading sound from 1.5 seconds.
Furthermore, if a plurality of scene sounds are overlapped on the same audio frame when the scene sounds are added in the character reading sound, that is, if a plurality of scene sounds are positioned on the same audio frame and trigger sounds and environmental background sounds are positioned in the plurality of scene sounds, the environmental background sounds can be reduced and the trigger sounds can be enhanced so as to highlight the trigger sounds. It can be understood that the trigger tone is suddenly changed, and the volume of the trigger tone is very fast, for example, the target sentence is: zhang Sansay that "outdoor traffic sounds disturb me and suddenly hear someone to knock the door", at this time, the sound effect of the environmental background sound corresponding to "traffic sounds" may last until the beginning of the trigger sound corresponding to "knock the door", and in order to improve the hearing experience of the user, the trigger sound needs to be highlighted. The volume of the trigger tone may be increased based on a preset volume balancing technique such that the volume of the trigger tone is higher than the volume of the ambient background tone; specifically, the volume balance can be performed on the environmental background sound and the trigger sound of the same audio frame, and the volume is normalized to the preset volume; the volume balance is to calculate average power of sound files close to time, the average power is to square average each sampling point of the voice in time, and then take logarithm, and the unit is decibel. And then normalized to the same or a specified volume level. Then, the volume of the trigger tone in the audio frame is increased, and the volume of the environmental background tone is reduced so that the volume of the trigger tone is higher than that of the environmental background tone. It can be appreciated that when there are multiple overlapping scene effects, if a certain scene effect is wanted to be highlighted, the volume of the scene effect can be increased (enhanced) by the preset volume balancing technique.
In one embodiment, after the target audio corresponding to the target sentence is obtained, character reading sounds of the target sentence and corresponding text input preset voice alignment models are aligned, and subtitles with time stamps are output, namely, the reading sounds without scene sound effects and corresponding text are used as input through a voice alignment technology, so that output subtitles with time stamps are obtained. Specifically, a Kaldi-based phonetic text alignment technique may be used to obtain the start and stop time of each text in the speech, or other technical schemes may be used. In the case of long text, for example, the beginning and ending time of a paragraph or a sentence is generally taken as the time finally presented to the user, and no word-by-word time is used. Therefore, the moment corresponding to the current sound reading can be known while the visual fatigue of the user is avoided.
The embodiment of the application provides a making device of a voice book, as shown in fig. 6, including:
an obtaining unit 601, configured to obtain a text corresponding to a voice book;
a determining unit 602, configured to determine a target sentence related to a character and related to a scene in the text;
the processing unit 603 is configured to perform a sounding process on the target sentence according to an audio feature corresponding to the character information of the target sentence, so as to obtain a character reading sound matched with the audio feature;
An execution unit 604, configured to obtain a scene sound effect matched with the scene information according to the scene information of the target sentence;
and the mixing unit 605 is configured to determine a sentence position of the scene information in the target sentence, and add the scene sound effect to an audio segment of the character reading sound corresponding to the sentence position, so as to obtain a target audio corresponding to the target sentence.
The embodiment of the application provides a making device 700 of a voice book, as shown in fig. 7, including:
a central processing unit 701, a memory 702, and an input/output interface 703;
the memory 702 is a transient memory or a persistent memory;
the central processor 701 is configured to communicate with the memory 702 and execute the instructions in the memory 702 to operate the method of making a voice book as described above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, or the like, which can store program codes.

Claims (11)

1. A method of making an audio book, comprising:
acquiring a text corresponding to the audio book;
determining target sentences related to roles and scenes in the text;
performing sounding processing on the target sentence according to the audio features corresponding to the character information of the target sentence to obtain character reading sounds matched with the audio features;
obtaining scene sound effects matched with the scene information according to the scene information of the target sentence;
determining the sentence position of the scene information in the target sentence, and adding the scene sound effect into the audio segment of the character reading sound corresponding to the sentence position to obtain the target audio corresponding to the target sentence.
2. The method of claim 1, wherein the determining a sentence position of the scene information in the target sentence includes;
acquiring a first phoneme sequence of a scene content word corresponding to the scene information;
and determining a sequence position of the first phoneme sequence in the phoneme sequence corresponding to the target sentence, and taking the sequence position as the sentence position.
3. The method of claim 2, wherein adding the scene sound effect to the audio segment of the character's speakable sound corresponding to the sentence position comprises:
Determining an audio frame sequence corresponding to the first phoneme sequence in the character reading sound according to the sequence position;
taking a preset audio frame corresponding to the audio frame sequence as an audio effect initial frame of the scene audio effect;
determining the duration of the sound effect of the scene according to the sound source in the scene information;
and adding the scene sound effect into the character reading sound according to the sound effect starting frame and the sound effect duration.
4. A method of producing as claimed in claim 3, wherein said determining the duration of the sound effect of the scene from the sound sources in the scene information comprises:
if the volume change value of the sound source in the preset duration is larger than a preset volume threshold, determining that the scene sound effect is a trigger sound, and the duration of the sound effect of the trigger sound is smaller than the preset duration;
if the volume change value of the sound source in the preset duration is smaller than a preset volume threshold, determining that the scene sound effect is environmental background sound, and the duration of the sound effect of the environmental background sound is longer than the preset duration.
5. The method of claim 4, wherein adding the scene sound effect to the character read sound comprises:
If a plurality of scene sound effects are located in the same audio frame and the trigger sound and the environmental background sound are located in the plurality of scene sound effects, the volume of the trigger sound is increased based on a preset volume balance technology, so that the volume of the trigger sound is higher than that of the environmental background sound.
6. The method of claim 3, wherein adding the scene sound effect to the character-read sound according to the sound effect start frame and the sound effect duration comprises:
fade the said scene sound effect into the said role reads the sound in the said sound effect initial frame, and strengthen the volume of the said scene sound effect until reaching the preset volume;
and reducing the volume of the scene sound effect in a preset ending period of the sound effect duration, and fading out the character reading sound from the scene sound effect.
7. The method of claim 1, wherein the determining the role-related and scene-related target sentences in the text comprises:
determining whether preset role dialogue information exists in semantic information of sentences in the text or whether the semantic information of the sentences in the text is matched with semantic information in a preset role sentence set;
If the preset role dialogue information exists or is matched with the semantic information in the preset role sentence set, determining that the sentence is a role sentence related to the role;
and if the role sentences have preset scene semantics, determining the role sentences as the target sentences.
8. The method of claim 1, wherein the performing the vocalizing process on the target sentence according to the audio feature corresponding to the character information of the target sentence to obtain the character reading sound matched with the audio feature includes:
inputting the target sentence into a preset reading model, and determining tone characteristics corresponding to character information of the target sentence in the preset reading model;
converting the phoneme sequence corresponding to the target sentence into a target audio feature based on the tone color feature corresponding to the character information;
and converting the target audio characteristics into sound waveforms to obtain character reading sounds matched with the audio characteristics.
9. The method of claim 1, wherein the obtaining, from the scene information of the target sentence, a scene sound effect that matches the scene information includes:
And determining scene sound effects matched with the scene information from a preset sound effect library according to preset scene semantics in the scene information, wherein the preset sound effect library comprises scene sound effects corresponding to various scene semantics.
10. A device for producing an audio book, comprising:
a central processing unit, a memory and an input/output interface;
the memory is a short-term memory or a persistent memory;
the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method of any of claims 1 to 9.
11. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 9.
CN202310312863.1A 2023-03-28 2023-03-28 Method and device for manufacturing audio book and storage medium Pending CN116403561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310312863.1A CN116403561A (en) 2023-03-28 2023-03-28 Method and device for manufacturing audio book and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310312863.1A CN116403561A (en) 2023-03-28 2023-03-28 Method and device for manufacturing audio book and storage medium

Publications (1)

Publication Number Publication Date
CN116403561A true CN116403561A (en) 2023-07-07

Family

ID=87017157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310312863.1A Pending CN116403561A (en) 2023-03-28 2023-03-28 Method and device for manufacturing audio book and storage medium

Country Status (1)

Country Link
CN (1) CN116403561A (en)

Similar Documents

Publication Publication Date Title
KR102265972B1 (en) Method and apparatus for voice translation using a multilingual text-to-speech synthesis model
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN109949783B (en) Song synthesis method and system
Eyben et al. Unsupervised clustering of emotion and voice styles for expressive TTS
KR20200015418A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
Zwicker et al. Automatic speech recognition using psychoacoustic models
US20100057435A1 (en) System and method for speech-to-speech translation
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
KR20230043084A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN113658577B (en) Speech synthesis model training method, audio generation method, equipment and medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN114125506B (en) Voice auditing method and device
CN113593522A (en) Voice data labeling method and device
CN113851140A (en) Voice conversion correlation method, system and device
CN116453502A (en) Cross-language speech synthesis method and system based on double-speaker embedding
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
CN115472185A (en) Voice generation method, device, equipment and storage medium
CN113299272B (en) Speech synthesis model training and speech synthesis method, equipment and storage medium
CN116403561A (en) Method and device for manufacturing audio book and storage medium
CN114267325A (en) Method, system, electronic device and storage medium for training speech synthesis model
JP2806364B2 (en) Vocal training device
Maia et al. An HMM-based Brazilian Portuguese speech synthesizer and its characteristics
CN114724540A (en) Model processing method and device, emotion voice synthesis method and device
Louw et al. The Speect text-to-speech entry for the Blizzard Challenge 2016

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination