CN111402856A - Voice processing method and device, readable medium and electronic equipment - Google Patents

Voice processing method and device, readable medium and electronic equipment Download PDF

Info

Publication number
CN111402856A
CN111402856A CN202010209023.9A CN202010209023A CN111402856A CN 111402856 A CN111402856 A CN 111402856A CN 202010209023 A CN202010209023 A CN 202010209023A CN 111402856 A CN111402856 A CN 111402856A
Authority
CN
China
Prior art keywords
processed
target
spectrum data
voice information
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010209023.9A
Other languages
Chinese (zh)
Other versions
CN111402856B (en
Inventor
殷翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010209023.9A priority Critical patent/CN111402856B/en
Publication of CN111402856A publication Critical patent/CN111402856A/en
Application granted granted Critical
Publication of CN111402856B publication Critical patent/CN111402856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The present disclosure relates to a voice processing method, apparatus, readable medium and electronic device, including: receiving voice information to be processed input by a user, and determining target music score information corresponding to the voice information to be processed; extracting to-be-processed spectrum data in to-be-processed voice information; obtaining corrected target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information; and synthesizing the target voice after tuning through a vocoder according to the target spectrum data. Through the technical scheme, the target voice information can be tuned in a mode of correcting the target frequency spectrum information in the target voice information input by the user, extra hardware overhead is not needed, and sound correction on any signal processing layer is not needed, so that the problems of sound distortion after tuning or poor auditory sensation under the condition of large sound difference are avoided, the sound correction effect is improved, and the sound correction cost is reduced.

Description

Voice processing method and device, readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech processing method, apparatus, readable medium, and electronic device.
Background
In the prior art, when the pitch or rhythm of the song sung by the user is different from the original song, the pitch or rhythm of the song sung by the user can be modified to be close to the signal of the original song as much as possible only by means of hardware adjustment (such as a sound card) or signal processing. Such a processing method requires high cost, for example, a sound card needs to be purchased, and when a sound signal of a song to be performed by a user is still likely to have a large sound difference only from an original sound signal, the sound after tuning is distorted or the auditory sensation is poor.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a method for speech processing, the method comprising:
receiving voice information to be processed input by a user, and determining target music score information corresponding to the voice information to be processed;
extracting to-be-processed spectrum data in the to-be-processed voice information;
obtaining corrected target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information;
and synthesizing the target voice after tuning through a vocoder according to the target spectrum data.
In a second aspect, the present disclosure also provides a speech processing apparatus, the apparatus comprising:
the receiving module is used for receiving voice information to be processed input by a user and determining target music score information corresponding to the voice information to be processed;
the extraction module is used for extracting the frequency spectrum data to be processed in the voice information to be processed;
the conversion module is used for obtaining modified target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information;
and the synthesis module is used for synthesizing the target voice after tuning through a vocoder according to the target spectrum data.
In a third aspect, the present disclosure also provides a computer-readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.
In a fourth aspect, the present disclosure also provides an electronic device, including:
a storage device having one or more computer programs stored thereon;
one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of the first aspect.
Through the technical scheme, the target voice information can be tuned in a mode of correcting the target frequency spectrum information in the target voice information input by the user, extra hardware overhead is not needed, and sound correction on any signal processing layer is not needed, so that the problems of sound distortion after tuning or poor auditory sensation under the condition of large sound difference are avoided, the sound correction effect is improved, and the sound correction cost is reduced.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
In the drawings:
FIG. 1 is a flow chart illustrating a method of speech processing according to an exemplary embodiment of the present disclosure.
Fig. 2 is a flowchart illustrating a method of speech processing according to yet another exemplary embodiment of the present disclosure.
Fig. 3 is a flowchart illustrating a training method of a neural network model preset in a speech processing method according to still another exemplary embodiment of the present disclosure.
Fig. 4 is a flowchart illustrating a method of speech processing according to yet another exemplary embodiment of the present disclosure.
Fig. 5 is a block diagram illustrating a structure of a speech processing apparatus according to an exemplary embodiment of the present disclosure.
FIG. 6 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
FIG. 1 is a flow chart illustrating a method of speech processing according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 104.
In step 101, to-be-processed voice information input by a user is received, and target music score information corresponding to the to-be-processed voice information is determined.
The target music score information corresponding to the voice information to be processed is preset, when the user inputs the voice information to be processed, the song template corresponding to the voice information to be processed can be determined, each song template has preset music score information corresponding to the song template, and the music score information can be directly obtained from the local or other databases. The music score information includes data such as midi information and decorative tones. Therefore, in the case where the to-be-processed speech information is received, the target score information corresponding to the to-be-processed speech information can be determined.
In step 102, to-be-processed spectrum data in the to-be-processed voice information is extracted.
The spectral data to be processed may be, for example, Mel spectra (Mel Bank Features).
In step 103, modified target spectrum data is obtained according to the to-be-processed spectrum data and the target music score information. After the frequency spectrum data to be processed is obtained through extraction, the frequency spectrum to be processed is corrected according to the target music score information, and therefore the corrected target frequency spectrum data corresponding to the target music score information is obtained.
In step 104, the target voice after tuning is obtained through vocoder synthesis according to the target spectrum data.
After the target spectrum data is obtained, the target voice after being modulated is synthesized by the vocoder. In one possible embodiment, the vocoder may be a predictive neural network vocoder, such as a WaveNet vocoder. Under the condition of obtaining the target voice, the target voice and the corresponding accompaniment template can be synthesized to be used as a complete modified song according to the actual situation.
Through the technical scheme, the target voice information can be tuned in a mode of correcting the target frequency spectrum information in the target voice information input by the user, extra hardware overhead is not needed, and sound correction on any signal processing layer is not needed, so that the problems of sound distortion after tuning or poor auditory sensation under the condition of large sound difference are avoided, the sound correction effect is improved, and the sound correction cost is reduced.
Fig. 2 is a flowchart illustrating a method of speech processing according to yet another exemplary embodiment of the present disclosure. As shown in fig. 2, the method includes step 201 in addition to step 101, step 102 and step 104 shown in fig. 1.
In step 201, the to-be-processed spectrum data and the target music score information are input into a preset neural network model to obtain modified target spectrum data.
That is, the manner of obtaining the modified target spectrum data according to the to-be-processed spectrum data and the target curvelet information may be implemented by using the preset neural network model. The preset neural network model is a pre-trained network model, and after the spectrum data to be processed in the voice information to be processed and the music score information corresponding to the voice information to be processed are input into the preset neural network model together, the target spectrum data after sound modification processing can be obtained directly by modifying the spectrum data to be processed according to the target music score information.
In one possible embodiment, the user input of the pending speech information may be a singing speech information without song accompaniment. In this case, it is only necessary to directly extract the to-be-processed spectrum data corresponding to the to-be-processed voice information, and the extracted to-be-processed spectrum data is the to-be-processed spectrum data corresponding to the singing.
In another possible implementation, the to-be-processed speech information input by the user is speech information including a mix of song accompaniments. In this case, before performing step 102 shown in fig. 1, the song accompaniment needs to be separated from the to-be-processed voice information to obtain the to-be-processed voice information without singing of the song accompaniment, that is, the song accompaniment and the singing in the to-be-processed voice information are separated, and only the singing is kept in the to-be-processed voice information, and then according to step 102 shown in fig. 1, the to-be-processed spectrum data in the to-be-processed voice information without singing of the song accompaniment is extracted, where the to-be-processed spectrum data is the to-be-processed spectrum data corresponding to the singing.
Fig. 3 is a flowchart illustrating a training method of a neural network model preset in a speech processing method according to still another exemplary embodiment of the present disclosure. As shown in fig. 3, the training method includes steps 301 to 304.
In step 301, the melody data and/or the pitch data in the original voice information of the song to be trained are irregularly adjusted to obtain a plurality of pieces of off-tune voice information corresponding to the song to be trained. The song to be trained is any song which can be used for training the preset neural network model and can acquire the original singing voice information. Under the condition that the song to be trained is determined, the melody data and/or the pitch data of the song to be trained can be irregularly adjusted according to the midi order of the song to be trained, and the adjusted voice information can be used as the off-tune voice information of the song to be trained.
In step 302, music score information corresponding to the song to be trained is obtained. The score information of the song to be trained may be prepared in advance.
In step 303, tuning voice information obtained by tuning the original singing voice information of the song to be trained by a tuning operator is obtained. The obtained tuning voice information obtained after the sound tuning of the original voice information by the sound tuning engineer can be completed by any sound tuning engineer before the preset neural network model is trained, and the sound tuning of the original voice information by the sound tuning engineer does not need to refer to the music score information of the song to be trained.
In step 304, the first spectrum data extracted from the detuning speech information and the music score information corresponding to the song to be trained are used as input training data of the preset neural network model, and the second spectrum data extracted from the tuning speech information is used as output training data of the preset neural network model, so as to train the preset neural network model.
After the tonal speech information and the tonal speech information are obtained, the obtained information can be used as a set of training data pairs for training the preset neural network model.
In addition, the above-mentioned tonal speech information and tonal speech information may be used as a set of training data pair for training the preset neural network model, and the original vocal speech information and the tonal speech information may also be used as a set of training data pair for training the preset neural network model. For example, the input training data for training the preset neural network model may include 10% of out-of-tune speech information and speech information with insufficient pronunciation.
Through the technical scheme, the tuning data of the original singing voice information can be used as the learning target of the preset neural network model by the tuning operator, so that the model obtained by training can learn how the tuning operator can tune according to the music score information of the song, the target voice obtained after the tuning of the preset neural network model is closer to the tuning result of the tuning operator, the problems of sound distortion after tuning and poor listening sense caused by the fact that the tuning is carried out according to the music score due to too large sound difference can be avoided, and the tuning effect can be improved.
In a possible implementation manner, the out-of-tune speech information and the original speech information may be used as a set of data pairs as training data for the preset neural network model.
In one possible embodiment, the preset neural network model is trained by: and classifying the training data of the preset neural network model according to style types. For example, in the training process of the preset neural network model, different IDs are used for different song styles, such as jazz, rock, ballad, and the like, so that the preset neural network model can learn different tuning features for different song styles respectively.
Fig. 4 is a flowchart illustrating a method of speech processing according to yet another exemplary embodiment of the present disclosure. As shown in fig. 4, the method includes steps 401 and 402, in addition to steps 101, 102 and 104 shown in fig. 1.
In step 401, a target style type corresponding to the voice information to be processed is determined. The target genre type may be a genre type that has been set in a song template corresponding to the speech information to be processed. Alternatively, the voice information to be processed may be determined by performing style recognition on the voice information.
If the songs of different styles are learned separately during the training of the preset neural network model, in step 103 shown in fig. 1, in addition to inputting the to-be-processed spectrum data extracted from the to-be-processed speech information and the target music score information corresponding to the to-be-processed speech information into the preset neural network model for modification, in step 402 shown in fig. 4, the to-be-processed spectrum data, the target music score information, and the target style type may be input into the preset neural network model to obtain the modified target spectrum data.
Wherein the target style type may control the conversion in the form of an ID. For the preset neural network model, the target style type can be input by adopting a one-hot vector coding mode, so that the preset neural network model identifies the style type of the voice information to be processed, and the conversion characteristics corresponding to the style type of the voice information to be processed are adopted to convert the frequency spectrum data to be processed.
Fig. 5 is a block diagram illustrating a structure of a speech processing apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 5, the apparatus 100 includes: the system comprises a receiving module 10, a processing module and a processing module, wherein the receiving module is used for receiving voice information to be processed input by a user and determining target music score information corresponding to the voice information to be processed; an extracting module 20, configured to extract to-be-processed spectrum data in the to-be-processed voice information; the conversion module 30 is configured to obtain modified target spectrum data according to the to-be-processed spectrum data and the target music score information; and the synthesis module 40 is configured to synthesize, by a vocoder, the tuned target voice according to the target spectrum data.
Through the technical scheme, the target voice information can be tuned in a mode of correcting the target frequency spectrum information in the target voice information input by the user, extra hardware overhead is not needed, and sound correction on any signal processing layer is not needed, so that the problems of sound distortion after tuning or poor auditory sensation under the condition of large sound difference are avoided, the sound correction effect is improved, and the sound correction cost is reduced.
In a possible implementation, the conversion module 30 is further configured to:
and inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain corrected target frequency spectrum data.
In one possible embodiment, the vocoder is a predictive neural network vocoder.
In one possible implementation, the voice information to be processed input by the user is the singing voice information without song accompaniment.
In a possible implementation manner, the to-be-processed speech information input by the user is speech information including a mixed sound of song accompaniment, and before extracting to-be-processed spectrum data in the to-be-processed speech information, the apparatus 100 further includes:
and the separation module is used for separating the song accompaniment from the voice information to be processed so as to obtain the singing voice information to be processed without the song accompaniment.
In one possible embodiment, the preset neural network model is trained by:
irregular adjustment is carried out on melody data and/or pitch data in original singing voice information of a song to be trained so as to obtain a plurality of pieces of off-tune voice information corresponding to the song to be trained;
acquiring music score information corresponding to the song to be trained;
obtaining tuning voice information obtained after tuning the original singing voice information of the song to be trained by a tuner;
and taking first spectrum data extracted from the detuning voice information and music score information corresponding to the song to be trained as input training data of the preset neural network model, and taking second spectrum data extracted from the tuning voice information as output training data of the preset neural network model so as to train the preset neural network model.
In one possible embodiment, the preset neural network model is trained by:
and classifying the training data of the preset neural network model according to style types.
In a possible implementation, the apparatus 100 further comprises: the determining module is used for determining a target style type corresponding to the voice information to be processed; the conversion model is further used for inputting the frequency spectrum data to be processed, the target music score information and the target style type into a preset neural network model so as to obtain corrected target frequency spectrum data.
Referring now to FIG. 6, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
Examples of communication networks include a local area network ("L AN"), a wide area network ("WAN"), AN internet network (e.g., the internet), and a peer-to-peer network (e.g., AN ad hoc peer-to-peer network), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: receiving voice information to be processed input by a user, and determining target music score information corresponding to the voice information to be processed; extracting to-be-processed spectrum data in to-be-processed voice information; inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain corrected target frequency spectrum data; and synthesizing the target frequency spectrum data through a preset neural network vocoder to obtain the target voice after tuning.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including but not limited to AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not constitute a limitation to the module itself in some cases, and for example, the receiving module may also be described as a "module that receives voice information to be processed input by a user".
For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex programmable logic devices (CP L D), and so forth.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Example 1 provides, in accordance with one or more embodiments of the present disclosure, a method of speech processing, the method comprising: receiving voice information to be processed input by a user, and determining target music score information corresponding to the voice information to be processed;
extracting to-be-processed spectrum data in the to-be-processed voice information;
obtaining corrected target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information;
and synthesizing the target voice after tuning through a vocoder according to the target spectrum data.
Example 2 provides the method of example 1, and the obtaining modified target spectrum data according to the to-be-processed spectrum data and the target curved spectrum information includes:
and inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain corrected target frequency spectrum data.
Example 3 provides the method of example 1, the vocoder being a predictive neural network vocoder, according to one or more embodiments of the present disclosure.
Example 4 provides the method of any one of examples 1-3, the user-input pending speech information being a chorus speech information without song accompaniment, according to one or more embodiments of the present disclosure.
Example 5 provides the method of any one of examples 1 to 3, wherein the to-be-processed speech information input by the user is speech information including a remix of a song accompaniment, and before extracting to-be-processed spectrum data in the to-be-processed speech information, the method further includes:
and separating the song accompaniment from the voice information to be processed to obtain the clear voice information to be processed without the song accompaniment.
Example 6 provides the method of example 2, the predictive neural network model being trained by:
irregular adjustment is carried out on melody data and/or pitch data in original singing voice information of a song to be trained so as to obtain a plurality of pieces of off-tune voice information corresponding to the song to be trained;
acquiring music score information corresponding to the song to be trained;
obtaining tuning voice information obtained after tuning the original singing voice information of the song to be trained by a tuner;
and taking first spectrum data extracted from the detuning voice information and music score information corresponding to the song to be trained as input training data of the preset neural network model, and taking second spectrum data extracted from the tuning voice information as output training data of the preset neural network model so as to train the preset neural network model.
Example 7 provides the method of example 2, the predictive neural network model being trained by:
and classifying the training data of the preset neural network model according to style types. .
Example 8 provides the method of example 7, further comprising, in accordance with one or more embodiments of the present disclosure:
determining a target style type corresponding to the voice information to be processed;
the inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain the corrected target frequency spectrum data comprises:
and inputting the frequency spectrum data to be processed, the target music score information and the target style type into a preset neural network model to obtain corrected target frequency spectrum data.
Example 9 provides, in accordance with one or more embodiments of the present disclosure, a speech processing apparatus, the apparatus comprising:
the receiving module is used for receiving voice information to be processed input by a user and determining target music score information corresponding to the voice information to be processed;
the extraction module is used for extracting the frequency spectrum data to be processed in the voice information to be processed;
the conversion module is used for obtaining modified target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information;
and the synthesis module is used for synthesizing the target voice after tuning through a vocoder according to the target spectrum data.
Example 10 provides the apparatus of example 9, the conversion module further to:
and inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain corrected target frequency spectrum data.
Example 11 provides the apparatus of example 9, the vocoder being a pre-set neural network vocoder, according to one or more embodiments of the present disclosure.
Example 12 provides the apparatus of any one of examples 9-11, the user-input to-be-processed speech information being a singing speech information without song accompaniment, according to one or more embodiments of the present disclosure.
Example 13 provides the apparatus of any one of examples 9 to 11, in accordance with one or more embodiments of the present disclosure, where the to-be-processed speech information input by the user is speech information including a remix of a song accompaniment, and before extracting to-be-processed spectral data in the to-be-processed speech information, the method further includes:
and separating the song accompaniment from the voice information to be processed to obtain the clear voice information to be processed without the song accompaniment.
Example 14 provides the apparatus of example 10, the preset neural network model being trained in accordance with one or more embodiments of the present disclosure by:
irregular adjustment is carried out on melody data and/or pitch data in original singing voice information of a song to be trained so as to obtain a plurality of pieces of off-tune voice information corresponding to the song to be trained;
acquiring music score information corresponding to the song to be trained;
obtaining tuning voice information obtained after tuning the original singing voice information of the song to be trained by a tuner;
and taking first spectrum data extracted from the detuning voice information and music score information corresponding to the song to be trained as input training data of the preset neural network model, and taking second spectrum data extracted from the tuning voice information as output training data of the preset neural network model so as to train the preset neural network model.
Example 15 provides the apparatus of example 10, the preset neural network model being trained in accordance with one or more embodiments of the present disclosure by:
and classifying the training data of the preset neural network model according to style types.
Example 16 provides the apparatus of example 15, the apparatus further comprising, in accordance with one or more embodiments of the present disclosure:
the determining module is used for determining a target style type corresponding to the voice information to be processed;
the conversion model is further used for inputting the frequency spectrum data to be processed, the target music score information and the target style type into a preset neural network model so as to obtain corrected target frequency spectrum data.
Example 17 provides a computer-readable medium, on which a computer program is stored, according to one or more embodiments of the present disclosure, characterized in that the program, when executed by a processing device, implements the steps of the method of any one of examples 1-8.
Example 18 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having one or more computer programs stored thereon; one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any of examples 1-8.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (11)

1. A method of speech processing, the method comprising:
receiving voice information to be processed input by a user, and determining target music score information corresponding to the voice information to be processed;
extracting to-be-processed spectrum data in the to-be-processed voice information;
obtaining corrected target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information;
and synthesizing the target voice after tuning through a vocoder according to the target spectrum data.
2. The method according to claim 1, wherein the obtaining the modified target spectrum data according to the to-be-processed spectrum data and the target warping spectrum information comprises:
and inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain corrected target frequency spectrum data.
3. The method of claim 1, wherein the vocoder is a pre-defined neural network vocoder.
4. The method according to any one of claims 1 to 3, wherein the voice information to be processed input by the user is a singing voice information without song accompaniment.
5. The method according to any one of claims 1 to 3, wherein the user-input speech information to be processed is speech information including a mix of song accompaniments, and before extracting the spectral data to be processed in the speech information to be processed, the method further comprises:
and separating the song accompaniment from the voice information to be processed to obtain the clear voice information to be processed without the song accompaniment.
6. The method of claim 2, wherein the pre-set neural network model is trained by:
irregular adjustment is carried out on melody data and/or pitch data in original singing voice information of a song to be trained so as to obtain a plurality of pieces of off-tune voice information corresponding to the song to be trained;
acquiring music score information corresponding to the song to be trained;
obtaining tuning voice information obtained after tuning the original singing voice information of the song to be trained by a tuner;
and taking first spectrum data extracted from the detuning voice information and music score information corresponding to the song to be trained as input training data of the preset neural network model, and taking second spectrum data extracted from the tuning voice information as output training data of the preset neural network model so as to train the preset neural network model.
7. The method of claim 2, wherein the pre-set neural network model is trained by:
and classifying the training data of the preset neural network model according to style types.
8. The method of claim 7, further comprising:
determining a target style type corresponding to the voice information to be processed;
the inputting the frequency spectrum data to be processed and the target music score information into a preset neural network model to obtain the corrected target frequency spectrum data comprises:
and inputting the frequency spectrum data to be processed, the target music score information and the target style type into a preset neural network model to obtain corrected target frequency spectrum data.
9. A speech processing apparatus, characterized in that the apparatus comprises:
the receiving module is used for receiving voice information to be processed input by a user and determining target music score information corresponding to the voice information to be processed;
the extraction module is used for extracting the frequency spectrum data to be processed in the voice information to be processed;
the conversion module is used for obtaining modified target frequency spectrum data according to the frequency spectrum data to be processed and the target music score information;
and the synthesis module is used for synthesizing the target voice after tuning through a vocoder according to the target spectrum data.
10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.
11. An electronic device, comprising:
a storage device having one or more computer programs stored thereon;
one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any one of claims 1-8.
CN202010209023.9A 2020-03-23 2020-03-23 Voice processing method and device, readable medium and electronic equipment Active CN111402856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010209023.9A CN111402856B (en) 2020-03-23 2020-03-23 Voice processing method and device, readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010209023.9A CN111402856B (en) 2020-03-23 2020-03-23 Voice processing method and device, readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111402856A true CN111402856A (en) 2020-07-10
CN111402856B CN111402856B (en) 2023-04-14

Family

ID=71429092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010209023.9A Active CN111402856B (en) 2020-03-23 2020-03-23 Voice processing method and device, readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111402856B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786006A (en) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, apparatus, medium, and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399044A (en) * 2007-09-29 2009-04-01 国际商业机器公司 Voice conversion method and system
CN101515458A (en) * 2008-02-19 2009-08-26 富士通株式会社 Encoding device, encoding method and computer program produce including the method
CN104347080A (en) * 2013-08-09 2015-02-11 雅马哈株式会社 Voice analysis method and device, voice synthesis method and device, and medium storing voice analysis program
CN105304080A (en) * 2015-09-22 2016-02-03 科大讯飞股份有限公司 Speech synthesis device and speech synthesis method
CN105825844A (en) * 2015-07-30 2016-08-03 维沃移动通信有限公司 Sound repairing method and device
WO2018003849A1 (en) * 2016-06-30 2018-01-04 ヤマハ株式会社 Voice synthesizing device and voice synthesizing method
CN108810075A (en) * 2018-04-11 2018-11-13 北京小唱科技有限公司 The audio update the system realized based on server end
CN109817197A (en) * 2019-03-04 2019-05-28 天翼爱音乐文化科技有限公司 Song generation method, device, computer equipment and storage medium
CN110415722A (en) * 2019-07-25 2019-11-05 北京得意音通技术有限责任公司 Audio signal processing method, storage medium, computer program and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399044A (en) * 2007-09-29 2009-04-01 国际商业机器公司 Voice conversion method and system
CN101515458A (en) * 2008-02-19 2009-08-26 富士通株式会社 Encoding device, encoding method and computer program produce including the method
CN104347080A (en) * 2013-08-09 2015-02-11 雅马哈株式会社 Voice analysis method and device, voice synthesis method and device, and medium storing voice analysis program
CN105825844A (en) * 2015-07-30 2016-08-03 维沃移动通信有限公司 Sound repairing method and device
CN105304080A (en) * 2015-09-22 2016-02-03 科大讯飞股份有限公司 Speech synthesis device and speech synthesis method
WO2018003849A1 (en) * 2016-06-30 2018-01-04 ヤマハ株式会社 Voice synthesizing device and voice synthesizing method
CN108810075A (en) * 2018-04-11 2018-11-13 北京小唱科技有限公司 The audio update the system realized based on server end
CN109817197A (en) * 2019-03-04 2019-05-28 天翼爱音乐文化科技有限公司 Song generation method, device, computer equipment and storage medium
CN110415722A (en) * 2019-07-25 2019-11-05 北京得意音通技术有限责任公司 Audio signal processing method, storage medium, computer program and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786006A (en) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 Speech synthesis method, synthesis model training method, apparatus, medium, and device

Also Published As

Publication number Publication date
CN111402856B (en) 2023-04-14

Similar Documents

Publication Publication Date Title
CN111445892B (en) Song generation method and device, readable medium and electronic equipment
CN111402843B (en) Rap music generation method and device, readable medium and electronic equipment
CN110503961B (en) Audio recognition method and device, storage medium and electronic equipment
CN106898340B (en) Song synthesis method and terminal
CN109543064B (en) Lyric display processing method and device, electronic equipment and computer storage medium
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
CN111369967B (en) Virtual character-based voice synthesis method, device, medium and equipment
CN111292717B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111402842B (en) Method, apparatus, device and medium for generating audio
CN112489620B (en) Speech synthesis method, device, readable medium and electronic equipment
CN111445897B (en) Song generation method and device, readable medium and electronic equipment
CN111583900A (en) Song synthesis method and device, readable medium and electronic equipment
CN111369971A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112153460B (en) Video dubbing method and device, electronic equipment and storage medium
CN110211556B (en) Music file processing method, device, terminal and storage medium
CN111161695B (en) Song generation method and device
CN109308901A (en) Chanteur's recognition methods and device
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN112786013A (en) Voice synthesis method and device based on album, readable medium and electronic equipment
CN111402856B (en) Voice processing method and device, readable medium and electronic equipment
CN111429881B (en) Speech synthesis method and device, readable medium and electronic equipment
CN111916050A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111369968A (en) Sound reproduction method, device, readable medium and electronic equipment
CN110400559B (en) Audio synthesis method, device and equipment
CN109495786B (en) Pre-configuration method and device of video processing parameter information and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant