CN111429881A - Sound reproduction method, device, readable medium and electronic equipment - Google Patents

Sound reproduction method, device, readable medium and electronic equipment Download PDF

Info

Publication number
CN111429881A
CN111429881A CN202010197181.7A CN202010197181A CN111429881A CN 111429881 A CN111429881 A CN 111429881A CN 202010197181 A CN202010197181 A CN 202010197181A CN 111429881 A CN111429881 A CN 111429881A
Authority
CN
China
Prior art keywords
target
copied
template
spectrum data
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010197181.7A
Other languages
Chinese (zh)
Other versions
CN111429881B (en
Inventor
殷翔
顾宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010197181.7A priority Critical patent/CN111429881B/en
Publication of CN111429881A publication Critical patent/CN111429881A/en
Application granted granted Critical
Publication of CN111429881B publication Critical patent/CN111429881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The present disclosure relates to a sound reproduction method, apparatus, readable medium, and electronic device, including: acquiring a to-be-copied sound and a target template input by a user, wherein the to-be-copied sound is a sound fragment with any length and pronounced by the user; extracting frequency spectrum data to be copied from the sound to be copied; determining template character information corresponding to the target template; and determining target frequency spectrum data corresponding to the target template and the tone in the sound to be copied according to the frequency spectrum data to be copied and the template character information. Therefore, the voice of the user can be copied according to the voice segment with any length input by the user, the characters are sounded according to the voice of the user, reading of the characters or singing of the songs are further realized, the voice input of the user according to limited contents is not needed, the long-time voice input of the user is not needed, and the complexity of voice copying of the user is simplified on the premise of ensuring the voice copying effect.

Description

Sound reproduction method, device, readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of speech synthesis technologies, and in particular, to a sound reproduction method, an apparatus, a readable medium, and an electronic device.
Background
In the prior art, the tone of a speaker is required to be copied, so that any voice can be automatically generated by the tone of the speaker, sufficient voice information of the speaker needs to be acquired, or even a large amount of specific voice information needs to be input by the speaker, so that a good voice tone copying effect can be realized, and a related voice copying model obtained by training voice training data of a single speaker cannot be universally used for other people, so that the voice copying model is difficult to apply to practical application.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a sound reproduction method, the method comprising:
acquiring a to-be-copied sound and a target template input by a user, wherein the to-be-copied sound is a sound fragment with any length and pronounced by the user;
extracting to-be-copied frequency spectrum data from the to-be-copied sound;
determining template character information corresponding to the target template;
determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information;
and synthesizing the target frequency spectrum data into target voice waveform data.
In a second aspect, the present disclosure also provides a sound reproducing apparatus, the apparatus comprising:
the system comprises an acquisition module, a target template and a reproduction module, wherein the acquisition module is used for acquiring a to-be-reproduced sound input by a user and the target template, and the to-be-reproduced sound is a sound segment with any length and is generated by the user;
the extraction module is used for extracting the frequency spectrum data to be copied from the sound to be copied;
the first determining module is used for determining template character information corresponding to the target template;
the second determining module is used for determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information;
and the synthesis module is used for synthesizing the target frequency spectrum data into target voice waveform data.
In a third aspect, the present disclosure also provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides an electronic device, including:
a storage device having one or more computer programs stored thereon;
one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method in the first aspect.
Through the technical scheme, the voice of the user can be copied according to the voice segment with any length input by the user, the characters are sounded according to the voice of the user, reading of the characters or singing of the songs are further realized, the voice input of the user according to limited contents is not required, the long-time voice input of the user is not required, and the complexity of voice copying of the user is greatly simplified on the premise of ensuring the voice copying effect.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
In the drawings:
fig. 1 is a flow chart illustrating a method of sound reproduction according to an exemplary embodiment of the present disclosure.
Fig. 2 is a flowchart illustrating a sound reproducing method according to still another exemplary embodiment of the present disclosure.
Fig. 3 is a flowchart illustrating a sound reproducing method according to still another exemplary embodiment of the present disclosure.
Fig. 4 is a flowchart illustrating a method of obtaining an acoustic model of a target neural network in a sound reproduction method according to still another exemplary embodiment of the present disclosure.
Fig. 5 is a flowchart illustrating a method of determining target spectrum data using a preset neural network acoustic model in a sound reproducing method according to still another exemplary embodiment of the present disclosure.
Fig. 6 is a flowchart illustrating a method of determining template text information in a sound reproducing method according to still another exemplary embodiment of the present disclosure.
Fig. 7 is a block diagram illustrating a structure of a sound reproducing apparatus according to an exemplary embodiment of the present disclosure.
FIG. 8 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Fig. 1 is a flow chart illustrating a method of sound reproduction according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 105.
In step 101, a to-be-copied sound input by a user and a target template are obtained, wherein the to-be-copied sound is a sound segment with any length and is generated by the user. The target template may be a segment of a speakable template or a song template. The sound to be copied input by the user can be a sound segment with any length, and the sound content can also be any character.
The sound to be copied may be input by the user in any way, for example, the sound to be copied may be directly input by the user in a live recording mode, or may be a segment of existing sound uploaded by the user. According to the length and the content of the sound to be copied, which are required to be input by a user, an upper time limit and/or a lower time limit for suggested input can be given according to actual conditions, for example, 10-30 minutes, so that the effect of generating the song is guaranteed.
The determination of the target template may be determined by the selection of the user, or may be automatically performed on a default template in a case where the user does not select the target template. That is, in the case that there are a plurality of templates that can be generated, after the user inputs the sound to be copied, the user may select the sound template that is desired to be generated by himself, or may directly use the default template to generate the sound without selection, or, in the case that the template selection function supports, may select one template randomly determined from all existing templates as the target template.
In step 102, spectrum data to be copied is extracted from the sound to be copied. The method for extracting the frequency spectrum data to be copied from the sound to be copied can be any method as long as the frequency spectrum data in the sound to be copied can be extracted. The spectrum data to be copied may be, for example, Mel spectra (Mel Bank Features).
In step 103, template text information corresponding to the target template is determined. When the target template is a reading template, the template text information corresponding to the target template may be phoneme information of the text to be read, and the like.
In step 104, target spectrum data corresponding to the target template and the timbre of the sound to be copied is determined according to the spectrum data to be copied and the template text information. After the frequency spectrum data to be copied of the sound to be copied is obtained, the sound color of the sound to be copied can be copied according to the template character information in the determined target template, and the target frequency spectrum data is generated according to the template character information in the target template.
In step 105, the target spectrum data is synthesized into target speech waveform data. The target speech waveform data may be wave waveform data, for example. The target voice waveform data is voice data generated by copying the voice of the user and using the voice of the user and the target template.
The target speech waveform data may be synthesized by a predictive neural network vocoder, which may be, for example, a WaveNet vocoder.
Through the technical scheme, the voice of the user can be copied according to the voice segment with any length input by the user, the characters are sounded according to the voice of the user, reading of the characters or singing of the songs are further realized, the voice input of the user according to limited contents is not required, the long-time voice input of the user is not required, and the complexity of voice copying of the user is greatly simplified on the premise of ensuring the voice copying effect.
Fig. 2 is a flowchart illustrating a sound reproducing method according to still another exemplary embodiment of the present disclosure. As shown in fig. 2, the method includes step 201 in addition to steps 101 to 103 and 105 shown in fig. 1.
In step 201, target spectrum data corresponding to the target template and the timbre of the sound to be copied is determined according to the spectrum data to be copied and the template text information through a preset neural network acoustic model.
The preset neural network acoustic model is obtained by pre-training, and spectrum data to be copied in user voice can be converted into target spectrum data according to template text information in the target template, so that the effect of generating the target template by the user voice is achieved.
The training data used for training the preset neural network acoustic model can be a plurality of sound segments of a plurality of speakers and text information corresponding to the sound segments, the sound segments can be in a reading form or a song form, the sound segments in the reading form and the sound segments in the song form can be trained simultaneously, and different models can be trained on the sound segments in different forms respectively on the premise of ensuring the generation effect of the target spectrum data. In the case of training different forms of sound segments, respectively, different models may be used to replicate the user's sound according to the corresponding template form in the target template.
Fig. 3 is a flowchart illustrating a sound reproducing method according to still another exemplary embodiment of the present disclosure. As shown in fig. 3, the method includes steps 301 and 302 in addition to steps 101 to 103 and 105 shown in fig. 1.
In step 301, the preset neural network acoustic model is trained according to the frequency spectrum data to be copied and the template text information, so as to obtain a target neural network acoustic model corresponding to the frequency spectrum data to be copied.
In step 302, the template text information is used as an input of the target neural network acoustic model to obtain the target frequency spectrum data.
That is, when the spectrum data is converted according to the preset neural network acoustic model obtained by training a large number of voice segments of multiple speakers, the template text information in the target template and the spectrum data to be copied extracted from the voice to be copied input by the user are respectively used as the input and the output of the preset neural network acoustic model to further train the preset neural network acoustic model, so as to obtain the target neural network acoustic model after learning the acoustic characteristics of the voice of the user. And generating the target spectrum data according to the target neural network acoustic model.
By the technical scheme, the preset neural network acoustic model is trained in real time in a self-adaptive mode, so that the target neural network acoustic model for copying the user sound can have a better copying effect on the sound of each user.
Specifically, in a possible implementation manner, the preset neural network acoustic model may include a preset speaker verification (speaker verification) network sub-model and a preset converter sub-model. Step 401 and step 402 as shown in fig. 4 may be included in step 301 as shown in fig. 3.
In step 401, the spectrum data to be copied is used as the input of the preset speaker verification network sub-model to extract the target speaker characterization vector corresponding to the spectrum data to be copied. The targeted speaker characterization vector may be, for example, speeder embedding. The target speaker characterization vector is output information of one layer of the preset speaker verification network submodel, and is not the final output of the preset speaker verification network submodel, for example, a speaker characterization vector spoker embedding generated by a dense layer before a softmax layer in the preset speaker verification network submodel may be used.
In step 402, the target speaker characterization vector and the template text information are used as the input of the preset conversion submodel, and the spectrum data to be copied is used as the output of the conversion submodel for training, so as to obtain the target neural network acoustic model. After a target speaker characterization vector spaakerebedding in the spectrum data to be copied is obtained through the preset speaker confirmation network submodel, the target speaker characterization vector spaakerebedding and template text information in the target template are used as input of the preset conversion submodel, the spectrum data to be copied is used as output of the preset conversion submodel to further train the preset conversion submodel, and therefore the target neural network acoustic model comprising the trained preset conversion submodel can be obtained.
When the preset conversion submodel is further trained, weight initialization can be performed according to the preset conversion submodel obtained through the pre-training.
The preset speaker verification network submodel and the preset conversion submodel in the preset neural network acoustic model are also trained according to the method in the training process of the preset neural network acoustic model. For example, for any training data, inputting the spectrum data extracted from the voice segment in the training data into the preset speaker confirmation network submodel, obtaining the speaker token vector threadedding output by one layer of network in the model, taking out the speaker token vector threadedgingand the text information corresponding to the voice segment in the training data as the input of the preset conversion submodel, and training the preset conversion submodel by taking the spectrum data extracted from the voice segment as the output. By analogy, the training is done for all training data.
In step 301, the preset speaker verification network sub-model in the preset neural network acoustic model is not required to be further trained, and the target speaker characterization vector, which is extracted from the spectrum data to be copied, is directly used.
In addition, in a possible implementation, the preset neural network acoustic model may also include a preset speaker speech feature coding module (speech encoder) and a preset converter model. Step 501 and step 502 shown in fig. 5 may be included in step 201 shown in fig. 2.
In step 501, a representation of the target speaker in the spectrum data to be copied is obtained through the preset speaker voice feature coding module. The targeted speaker characterization may be a speeder representation.
In step 502, the target speaker representation and the template text information are used as the input of the preset conversion submodel to obtain the target spectrum data. After the target speaker representation spatker representation in the spectrum data to be copied is obtained through the preset speaker voice feature coding module, the target speaker representation spatker representation and template character information in the target template can be directly used as the input of the preset conversion submodule, and then the output of the preset conversion submodule, namely the target spectrum data, is obtained.
The preset speaker voice feature coding module and the preset conversion submodule are obtained through pre-training.
In a possible implementation manner, in the case that the target template is a song template, the template text information at least includes lyric phoneme information and music information, wherein the music information may be, for example, tone information, melody information, and the like, and the music information is music information related to the lyric information.
When the target template is a song template and template text information corresponding to the target template is determined, steps 601 to 603 shown in fig. 6 are further included.
In step 601, fundamental frequency data is extracted from the sound to be reproduced.
In step 602, the pitch information in the target template is adjusted according to the fundamental frequency data.
In step 603, template text information including the adjusted pitch information is used as the template text information of the target template.
In step 602, the method for adjusting the pitch information in the target template according to the fundamental frequency data in the sound to be reproduced may be to adjust the pitch information by the following formula:
F0=261.63*2^(midi/12)。
where F0 is the baseband data and midi is the pitch information.
Through the technical scheme, when a song is sung after the user voice is copied, the tone information in the song template is adjusted to the tone closer to the user voice, so that the finally generated song effect is more real, for example, the original tone information in the target template is G3, and the tone of the voice to be copied input by the user is higher, so that the tone information in the target template can be adjusted through the fundamental frequency information extracted from the voice to be copied. The pitch information in the target template, which is too high or too low for the sound to be reproduced, can be determined and adjusted by extracting the fundamental frequency information from the sound to be reproduced.
In a possible embodiment, after the target voice waveform data is obtained in step 105 shown in fig. 1, in the case that the target template is a song template, before the target voice waveform data is actually output, the accompaniment information corresponding to the target template may be further synthesized with the target voice waveform data to obtain a complete song with accompaniment.
Fig. 7 is a block diagram illustrating a structure of a sound reproducing apparatus 100 according to an exemplary embodiment of the present disclosure. As shown in fig. 7, the apparatus 100 includes: the acquisition module 10 is configured to acquire a to-be-copied sound and a target template, where the to-be-copied sound is a sound segment with any length and is generated by a user; an extracting module 20, configured to extract to-be-copied spectrum data from the to-be-copied sound; a first determining module 30, configured to determine template text information corresponding to the target template; a second determining module 40, configured to determine, according to the to-be-copied spectrum data and the template text information, target spectrum data corresponding to a tone in the target template and the to-be-copied sound; and a synthesizing module 50, configured to synthesize the target frequency spectrum data into target speech waveform data.
Through the technical scheme, the voice of the user can be copied according to the voice segment with any length input by the user, the characters are sounded according to the voice of the user, reading of the characters or singing of the songs are further realized, the voice input of the user according to limited contents is not required, the long-time voice input of the user is not required, and the complexity of voice copying of the user is greatly simplified on the premise of ensuring the voice copying effect.
In one possible implementation, the second determining module includes: and the first determining submodule is used for determining target frequency spectrum data corresponding to the tone in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template text information through a preset neural network acoustic model.
In one possible implementation, the first determining sub-module includes: the training submodule is used for training the preset neural network acoustic model according to the frequency spectrum data to be copied and the template text information so as to obtain a target neural network acoustic model corresponding to the frequency spectrum data to be copied; and the second determining submodule is used for taking the template text information as the input of the target neural network acoustic model to obtain the target frequency spectrum data.
In a possible implementation manner, the preset neural network acoustic model includes a preset speaker verification network sub-model and a preset conversion sub-model, and the training sub-module includes: the first training submodule is used for taking the spectrum data to be copied as the input of the preset speaker confirmation network submodel so as to extract a target speaker characterization vector corresponding to the spectrum data to be copied; and the second training submodule is used for taking the target speaker characterization vector and the template character information as the input of the preset conversion submodel and taking the frequency spectrum data to be copied as the output of the conversion submodel for training so as to obtain the target neural network acoustic model.
In a possible implementation manner, the preset neural network acoustic model includes a preset speaker speech feature coding module and a preset conversion sub-model, and the first determining sub-module includes: the extraction submodule is used for obtaining the representation of the target speaker in the spectrum data to be copied through the preset speaker voice feature coding module; and the third determining submodule is used for taking the target speaker representation and the template character information as the input of the preset conversion submodel so as to obtain the target frequency spectrum data.
In a possible implementation manner, in the case that the target template is a song template, the template text information at least includes lyric phoneme information and music information.
In a possible implementation, the music information includes at least tone information, and the first determining module includes: the first processing submodule is used for extracting fundamental frequency data from the sound to be copied; the second processing submodule is used for adjusting the tone information in the target template according to the fundamental frequency data; and the third processing submodule is used for taking the template character information comprising the adjusted tone information as the template character information of the target template.
Referring now to FIG. 8, shown is a schematic diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
In general, input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 807 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 808 including, for example, magnetic tape, hard disk, etc., and communication devices 809 may allow electronic device 800 to communicate wirelessly or wiredly with other devices to exchange data although FIG. 8 illustrates electronic device 800 with various devices, it is to be understood that not all of the illustrated devices are required to be implemented or provided, more or fewer devices may be alternatively implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
Examples of communication networks include a local area network ("L AN"), a wide area network ("WAN"), AN internet network (e.g., the internet), and a peer-to-peer network (e.g., AN ad hoc peer-to-peer network), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a to-be-copied sound and a target template input by a user, wherein the to-be-copied sound is a sound fragment with any length and pronounced by the user; extracting to-be-copied frequency spectrum data from the to-be-copied sound; determining template character information corresponding to the target template; determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information; and synthesizing the target frequency spectrum data into target voice waveform data.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including but not limited to AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation of the module itself, and for example, the acquiring module may also be described as a "module for acquiring the sound to be reproduced and the target template input by the user".
For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex programmable logic devices (CP L D), and so forth.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides a sound reproducing method according to one or more embodiments of the present disclosure, including:
acquiring a to-be-copied sound and a target template input by a user, wherein the to-be-copied sound is a sound fragment with any length and pronounced by the user;
extracting to-be-copied frequency spectrum data from the to-be-copied sound;
determining template character information corresponding to the target template;
determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information;
and synthesizing the target frequency spectrum data into target voice waveform data.
Example 2 provides the method of example 1, and the determining, according to the spectral data to be copied and the template text information, target spectral data corresponding to a timbre in the target template and the sound to be copied includes:
and determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template text information by a preset neural network acoustic model.
Example 3 provides the method of example 2, wherein determining, by a preset neural network acoustic model, target spectrum data corresponding to the target template and a timbre in the sound to be copied from the spectrum data to be copied and the template text information includes:
training the preset neural network acoustic model according to the frequency spectrum data to be copied and the template text information to obtain a target neural network acoustic model corresponding to the frequency spectrum data to be copied;
and taking the template text information as the input of the target neural network acoustic model to obtain the target frequency spectrum data.
According to one or more embodiments of the present disclosure, example 4 provides the method of example 3, where the preset neural network acoustic model includes a preset speaker confirmation network sub-model and a preset conversion sub-model, and the training of the preset neural network acoustic model according to the to-be-copied spectrum data and the template text information to obtain a target neural network acoustic model corresponding to the to-be-copied spectrum data includes:
taking the spectrum data to be copied as the input of the preset speaker confirmation network sub-model to extract a target speaker characterization vector corresponding to the spectrum data to be copied;
and taking the target speaker characterization vector and the template text information as the input of the preset conversion sub-model, and taking the frequency spectrum data to be copied as the output of the conversion sub-model for training to obtain the target neural network acoustic model.
Example 5 provides the method of example 2, wherein the preset neural network acoustic model includes a preset speaker voice feature coding module and a preset conversion sub-model, and the determining, by the preset neural network acoustic model, the target spectrum data corresponding to the target template and the timbre in the sound to be copied according to the spectrum data to be copied and the template text information includes:
obtaining a target speaker representation in the spectrum data to be copied through the preset speaker voice feature coding module;
and taking the target speaker representation and the template character information as the input of the preset conversion sub-model to obtain the target frequency spectrum data.
Example 6 provides the method of any one of examples 1 to 5, wherein in a case where the target template is a song template, the template text information includes at least lyric phoneme information and music information.
Example 7 provides the method of example 6, wherein the music information includes at least tone information, and the determining template text information corresponding to the target template includes:
extracting fundamental frequency data from the sound to be copied;
adjusting the tone information in the target template according to the fundamental frequency data;
and taking the template character information comprising the adjusted tone information as the template character information of the target template.
Example 8 provides, in accordance with one or more embodiments of the present disclosure, a sound reproducing apparatus, the apparatus including:
the system comprises an acquisition module, a target template and a reproduction module, wherein the acquisition module is used for acquiring a to-be-reproduced sound input by a user and the target template, and the to-be-reproduced sound is a sound segment with any length and is generated by the user;
the extraction module is used for extracting the frequency spectrum data to be copied from the sound to be copied;
the first determining module is used for determining template character information corresponding to the target template;
the second determining module is used for determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information;
and the synthesis module is used for synthesizing the target frequency spectrum data into target voice waveform data.
Example 9 provides the apparatus of example 8, the second determining module comprising: and the first determining submodule is used for determining target frequency spectrum data corresponding to the tone in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template text information through a preset neural network acoustic model.
Example 10 provides the apparatus of example 9, the first determination submodule comprising: the training submodule is used for training the preset neural network acoustic model according to the frequency spectrum data to be copied and the template text information so as to obtain a target neural network acoustic model corresponding to the frequency spectrum data to be copied; and the second determining submodule is used for taking the template text information as the input of the target neural network acoustic model to obtain the target frequency spectrum data.
Example 11 provides the apparatus of example 10, the preset neural network acoustic model including a preset speaker verification network sub-model and a preset converter sub-model, the training sub-module including: the first training submodule is used for taking the spectrum data to be copied as the input of the preset speaker confirmation network submodel so as to extract a target speaker characterization vector corresponding to the spectrum data to be copied; and the second training submodule is used for taking the target speaker characterization vector and the template character information as the input of the preset conversion submodel and taking the frequency spectrum data to be copied as the output of the conversion submodel for training so as to obtain the target neural network acoustic model.
Example 12 provides the apparatus of example 9, the preset neural network acoustic model including a preset speaker speech feature coding module and a preset converter sub-model, the first determining sub-module including: the extraction submodule is used for obtaining the representation of the target speaker in the spectrum data to be copied through the preset speaker voice feature coding module; and the third determining submodule is used for taking the target speaker representation and the template character information as the input of the preset conversion submodel so as to obtain the target frequency spectrum data.
Example 13 provides the apparatus of any one of examples 8 to 12, wherein, in a case where the target template is a song template, the template text information includes at least lyric phoneme information and music information.
Example 14 provides the apparatus of example 13, wherein the music information includes at least tone information, and the first determining module includes: the first processing submodule is used for extracting fundamental frequency data from the sound to be copied; the second processing submodule is used for adjusting the tone information in the target template according to the fundamental frequency data; and the third processing submodule is used for taking the template character information comprising the adjusted tone information as the template character information of the target template.
Example 15 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-7, in accordance with one or more embodiments of the present disclosure.
Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-7.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (10)

1. A method of sound reproduction, the method comprising:
acquiring a to-be-copied sound and a target template input by a user, wherein the to-be-copied sound is a sound fragment with any length and pronounced by the user;
extracting to-be-copied frequency spectrum data from the to-be-copied sound;
determining template character information corresponding to the target template;
determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information;
and synthesizing the target frequency spectrum data into target voice waveform data.
2. The method according to claim 1, wherein the determining target spectrum data corresponding to the target template and the timbre in the sound to be copied according to the spectrum data to be copied and the template text information comprises:
and determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template text information by a preset neural network acoustic model.
3. The method of claim 2, wherein the determining, by a preset neural network acoustic model, target spectrum data corresponding to the target template and the timbre in the sound to be copied according to the spectrum data to be copied and the template text information comprises:
training the preset neural network acoustic model according to the frequency spectrum data to be copied and the template text information to obtain a target neural network acoustic model corresponding to the frequency spectrum data to be copied;
and taking the template text information as the input of the target neural network acoustic model to obtain the target frequency spectrum data.
4. The method of claim 3, wherein the preset neural network acoustic model comprises a preset speaker confirmation network submodel and a preset conversion submodel, and the training of the preset neural network acoustic model according to the spectral data to be copied and the template text information to obtain the target neural network acoustic model corresponding to the spectral data to be copied comprises:
taking the spectrum data to be copied as the input of the preset speaker confirmation network sub-model to extract a target speaker characterization vector corresponding to the spectrum data to be copied;
and taking the target speaker characterization vector and the template text information as the input of the preset conversion sub-model, and taking the frequency spectrum data to be copied as the output of the conversion sub-model for training to obtain the target neural network acoustic model.
5. The method according to claim 2, wherein the preset neural network acoustic model comprises a preset speaker voice feature coding module and a preset conversion submodel, and the determining, by the preset neural network acoustic model, the target spectrum data corresponding to the target template and the timbre in the sound to be copied according to the spectrum data to be copied and the template text information comprises:
obtaining a target speaker representation in the spectrum data to be copied through the preset speaker voice feature coding module;
and taking the target speaker representation and the template character information as the input of the preset conversion sub-model to obtain the target frequency spectrum data.
6. The method according to any one of claims 1 to 5, wherein in the case where the target template is a song template, the template text information includes at least lyric phoneme information and music information.
7. The method of claim 6, wherein the music information at least comprises tone information, and the determining the template text information corresponding to the target template comprises:
extracting fundamental frequency data from the sound to be copied;
adjusting the tone information in the target template according to the fundamental frequency data;
and taking the template character information comprising the adjusted tone information as the template character information of the target template.
8. An apparatus for reproducing sound, the apparatus comprising:
the system comprises an acquisition module, a target template and a reproduction module, wherein the acquisition module is used for acquiring a to-be-reproduced sound input by a user and the target template, and the to-be-reproduced sound is a sound segment with any length and is generated by the user;
the extraction module is used for extracting the frequency spectrum data to be copied from the sound to be copied;
the first determining module is used for determining template character information corresponding to the target template;
the second determining module is used for determining target frequency spectrum data corresponding to the tone colors in the target template and the sound to be copied according to the frequency spectrum data to be copied and the template character information;
and the synthesis module is used for synthesizing the target frequency spectrum data into target voice waveform data.
9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.
10. An electronic device, comprising:
a storage device having one or more computer programs stored thereon;
one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any one of claims 1-7.
CN202010197181.7A 2020-03-19 2020-03-19 Speech synthesis method and device, readable medium and electronic equipment Active CN111429881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010197181.7A CN111429881B (en) 2020-03-19 2020-03-19 Speech synthesis method and device, readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010197181.7A CN111429881B (en) 2020-03-19 2020-03-19 Speech synthesis method and device, readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111429881A true CN111429881A (en) 2020-07-17
CN111429881B CN111429881B (en) 2023-08-18

Family

ID=71548135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010197181.7A Active CN111429881B (en) 2020-03-19 2020-03-19 Speech synthesis method and device, readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111429881B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530400A (en) * 2020-11-30 2021-03-19 清华珠三角研究院 Method, system, device and medium for generating voice based on text of deep learning
CN112614477A (en) * 2020-11-16 2021-04-06 北京百度网讯科技有限公司 Multimedia audio synthesis method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149217A1 (en) * 2012-03-30 2013-10-03 Ivanou Aliaksei Systems and methods for automated speech and speaker characterization
CN108182936A (en) * 2018-03-14 2018-06-19 百度在线网络技术(北京)有限公司 Voice signal generation method and device
CN108806665A (en) * 2018-09-12 2018-11-13 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN108831437A (en) * 2018-06-15 2018-11-16 百度在线网络技术(北京)有限公司 A kind of song generation method, device, terminal and storage medium
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
WO2019139430A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
WO2019138871A1 (en) * 2018-01-11 2019-07-18 ヤマハ株式会社 Speech synthesis method, speech synthesis device, and program
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology
WO2020007148A1 (en) * 2018-07-05 2020-01-09 腾讯科技(深圳)有限公司 Audio synthesizing method, storage medium and computer equipment
US20200051583A1 (en) * 2018-08-08 2020-02-13 Google Llc Synthesizing speech from text using neural networks
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149217A1 (en) * 2012-03-30 2013-10-03 Ivanou Aliaksei Systems and methods for automated speech and speaker characterization
WO2019139430A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
WO2019138871A1 (en) * 2018-01-11 2019-07-18 ヤマハ株式会社 Speech synthesis method, speech synthesis device, and program
CN108182936A (en) * 2018-03-14 2018-06-19 百度在线网络技术(北京)有限公司 Voice signal generation method and device
CN108831437A (en) * 2018-06-15 2018-11-16 百度在线网络技术(北京)有限公司 A kind of song generation method, device, terminal and storage medium
WO2020007148A1 (en) * 2018-07-05 2020-01-09 腾讯科技(深圳)有限公司 Audio synthesizing method, storage medium and computer equipment
US20200051583A1 (en) * 2018-08-08 2020-02-13 Google Llc Synthesizing speech from text using neural networks
CN108806665A (en) * 2018-09-12 2018-11-13 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN110600055A (en) * 2019-08-15 2019-12-20 杭州电子科技大学 Singing voice separation method using melody extraction and voice synthesis technology
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈宙斯,胡文心: "简化LSTM的语音合成", 计算机工程与应用, no. 03 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112614477A (en) * 2020-11-16 2021-04-06 北京百度网讯科技有限公司 Multimedia audio synthesis method and device, electronic equipment and storage medium
CN112614477B (en) * 2020-11-16 2023-09-12 北京百度网讯科技有限公司 Method and device for synthesizing multimedia audio, electronic equipment and storage medium
CN112530400A (en) * 2020-11-30 2021-03-19 清华珠三角研究院 Method, system, device and medium for generating voice based on text of deep learning

Also Published As

Publication number Publication date
CN111429881B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN111445892B (en) Song generation method and device, readable medium and electronic equipment
CN111402843B (en) Rap music generation method and device, readable medium and electronic equipment
CN106898340B (en) Song synthesis method and terminal
CN111583900B (en) Song synthesis method and device, readable medium and electronic equipment
CN111402842B (en) Method, apparatus, device and medium for generating audio
CN111445897B (en) Song generation method and device, readable medium and electronic equipment
CN111899720A (en) Method, apparatus, device and medium for generating audio
CN111292717B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112489606B (en) Melody generation method, device, readable medium and electronic equipment
CN110211556B (en) Music file processing method, device, terminal and storage medium
CN111782576B (en) Background music generation method and device, readable medium and electronic equipment
CN111798821A (en) Sound conversion method, device, readable storage medium and electronic equipment
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
WO2022042418A1 (en) Music synthesis method and apparatus, and device and computer-readable medium
CN111369968A (en) Sound reproduction method, device, readable medium and electronic equipment
CN111429881A (en) Sound reproduction method, device, readable medium and electronic equipment
CN112786013A (en) Voice synthesis method and device based on album, readable medium and electronic equipment
JP7497523B2 (en) Method, device, electronic device and storage medium for synthesizing custom timbre singing voice
CN111477210A (en) Speech synthesis method and device
CN111402856B (en) Voice processing method and device, readable medium and electronic equipment
CN115619897A (en) Image processing method, image processing device, electronic equipment and storage medium
CN112825245B (en) Real-time sound repairing method and device and electronic equipment
CN112685000A (en) Audio processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant