CN110782866A

CN110782866A - Singing sound converter

Info

Publication number: CN110782866A
Application number: CN201910868874.1A
Authority: CN
Inventors: 杨宇娟; 王小侠; 曹鑫
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2020-02-11

Abstract

The invention discloses a singing sound converter, and belongs to the technical field of musical instruments. The singing sound converter comprises: the system comprises a user singing system, an audio recognition system, an instant synthesizer, a human voice tone library and a player; the user singing system is used for detecting and converting the singing voice of the user singing song in real time and sending the singing voice to the audio recognition system; the audio recognition system is used for recognizing the singing voice through a preset neural network model, determining the voice characteristics of the singing voice and sending the voice characteristics to the instant synthesizer; and the instant synthesizer is used for determining the target human voice color library from the human voice color library, calling the tone in the target human voice color library according to the sound characteristics, synthesizing new singing sound and sending the new singing sound to the player. The invention can synthesize the new singing voice in real time while the user sings the song, thereby reducing the calculation time before synthesis, quickening the speed of synthesizing the new singing voice and ensuring the timeliness of synthesis.

Description

Singing sound converter

Technical Field

The invention relates to the technical field of musical instruments, in particular to a singing sound converter.

Background

In music singing, the same song has different singing methods, different people have different emotional expressions to the same song, sometimes the voice singing by one person needs to be converted into the voice singing by another person, and sometimes the voice singing by one person needs to be converted into different voice versions (such as American voice singing, ethnic singing and the like).

In the prior art, a voice changer or voice changing software often collects and identifies an audio signal of a singing sound of a user, processes the identified audio signal, and modifies the audio characteristic of the audio signal, so that the identified audio signal is converted to obtain a new singing sound.

However, the above conversion is realized by changing the audio signal, the audio characteristics of the original singing voice are still retained, the singing voice cannot be well processed, and the accuracy of the singing voice conversion is low; in addition, the above conversion is to recognize the audio signal of the singing voice of the user first, and then to perform the conversion of the singing voice, i.e. the audio recognition and the voice conversion are processed separately, so that the immediate conversion cannot be performed, and the efficiency of the conversion of the singing voice is low.

Disclosure of Invention

In order to solve the problems of low accuracy and low efficiency of singing voice conversion in the related art, the embodiment of the invention provides a singing voice converter, which comprises: the system comprises a user singing system, an audio recognition system, an instant synthesizer, a human voice tone library and a player;

the user singing system is used for detecting and converting the singing voice of a user singing song in real time and sending the singing voice to the audio recognition system;

the audio recognition system is used for recognizing the singing voice through a preset neural network model, determining the voice characteristics of the singing voice and sending the voice characteristics to the instant synthesizer;

the instant synthesizer is used for determining a target human voice color library from the human voice color library, calling the tone in the target human voice color library according to the sound characteristics, synthesizing new singing sound and sending the new singing sound to the player;

and the player is used for playing the new singing sound in real time.

Optionally, the detecting and converting the singing voice of the song performed by the user in real time and sending the singing voice to the audio recognition system includes:

when one audio frequency of a song sung by a conversion user is detected, determining the audio frequency as the singing sound, and sending the singing sound to the audio frequency recognition system; or,

when detecting that a preset number of audios of a song sung by a user are converted, determining the preset number of audios as the singing sound, and sending the singing sound to the audio recognition system.

Optionally, the recognizing the singing voice through a preset neural network model and determining the voice feature of the singing voice include:

and inputting the singing voice into the preset neural network model, and determining the output of the preset neural network model as the voice characteristic of the singing voice.

Optionally, before recognizing the singing voice through the preset neural network model, the method further includes:

and acquiring a singing sound set, and training parameters of the neural network through the singing sound set to obtain the preset neural network model.

Optionally, the training the parameters of the neural network through the singing sound set to obtain the preset neural network model includes:

marking the sound characteristics of each singing sound in the singing sound set;

inputting each singing voice in the singing voice set into a neural network, and adjusting parameters of the neural network according to a difference value between the output of the neural network and the labeled voice characteristics;

and when the difference value between the output of the neural network and the labeled sound characteristic is smaller than a preset parameter threshold value after each singing sound in the singing sound set is input into the neural network, determining the neural network as the preset neural network model.

Optionally, the sound features include pitch and pinyin, and before determining the target human sound color library from the human sound color library, the method further includes:

a user records the tone corresponding to the first preset tone of all pinyin combinations of Chinese in advance;

identifying whether the pitch of the tone is a first preset pitch, and if the pitch of the tone is not the first preset pitch, adjusting the pitch of the tone to the first preset pitch through a variable speed pitch-changing algorithm;

and expanding the timbres of a second preset pitch and a second preset tone through a preset algorithm according to the pitch of the timbres and the tone of the pinyin, marking the pitch and the pinyin of each timbre, and generating the human voice timbre library.

Optionally, the invoking a tone in the target person's voice tone library according to the sound feature to synthesize a new singing sound includes:

searching the sound characteristics in the target person sound color library;

and when the sound features are found, calling the tone corresponding to the sound features, and synthesizing the new singing sound.

Optionally, after the searching the sound feature in the target person sound color library, the method further includes:

and when the sound features cannot be found, calling the tone with the similarity greater than a preset similarity threshold with the sound features, and synthesizing the new singing sound.

and when the sound characteristics cannot be found, returning a synthesis failure signal and prompting a user to update the target person sound color library.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the tone in the target human voice tone library is used for replacing and converting the singing voice of the user, namely homologous substitution, but the change of the singing voice is not realized by changing the audio signal, the singing voice of the converted user can be thoroughly filtered, and the accuracy of synthesizing a new singing voice is improved; in addition, the tone covering all pinyin and pitch is generated in advance, and the new tone is generated without carrying out variable speed tone change in real time, so that the tone generated in advance can be called in real time while the user sings the song, the new singing sound is synthesized in real time, and the synthesis is not carried out after the user sings, thereby reducing the calculation time before the synthesis, quickening the speed of synthesizing the new singing sound, namely improving the efficiency of synthesizing the new singing sound and ensuring the timeliness of the synthesis.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a block diagram of a singing sound converter according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a recorded human voice color library according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

For convenience of understanding, before explaining the embodiments of the present invention in detail, an application scenario related to the embodiments of the present invention will be described.

With the rapid development of musical instrument technology, people often use a sound changer or sound changing software to perform singing voice conversion. At present, a voice changer or voice changing software often collects and identifies an audio signal of a singing voice of a user, processes the identified audio signal, modifies the audio characteristics of the audio signal, converts the identified audio signal to obtain a new singing voice, so that the audio characteristics of the original singing voice can still be kept, the singing voice cannot be well processed, the accuracy of the singing voice conversion is low, and the audio identification and the voice conversion are processed separately, so that the immediate conversion cannot be performed, and the efficiency of the singing voice conversion is also low. Therefore, the invention provides a singing voice converter to improve the accuracy and efficiency of singing voice conversion.

The singing voice converter provided by the embodiment of the invention will be described in detail with reference to fig. 1-2.

Fig. 1 is a block diagram of a singing voice converter according to an embodiment of the present invention. Referring to fig. 1, the singing sound converter includes: the system comprises a user singing system, an audio recognition system, an instant synthesizer, a human voice tone library and a player; the system comprises a user singing system, an audio recognition system and a voice recognition system, wherein the user singing system is used for detecting and converting the singing voice of a song sung by a user in real time and sending the singing voice to the audio recognition system; the audio recognition system is used for recognizing the singing voice through a preset neural network model, determining the voice characteristics of the singing voice and sending the voice characteristics to the instant synthesizer; the real-time synthesizer is used for determining a target human voice color library from the human voice color library, calling the tone in the target human voice color library according to the sound characteristics, synthesizing new singing sound and sending the new singing sound to the player; and the player is used for playing the new singing sound in real time.

Further, the specific working process of the user singing system detecting and converting the singing voice of the user singing song in real time and sending the singing voice to the audio recognition system can be as follows: when one audio of a song sung by a user is detected to be converted, the user singing system determines the one audio as a singing sound and sends the singing sound to an audio recognition system; or when detecting that the preset numerical value of the audio of the song sung by the user is converted, the user singing system determines the preset numerical value of the audio as the singing sound and sends the singing sound to the audio recognition system.

Wherein, the conversion user refers to a user needing to carry out singing voice conversion; the preset value may be preset, and the present invention is not limited to this, for example, the preset value may be 5, 10, 15, etc., and at this time, the user singing system may send the detected 5 audios to the audio recognition system as the singing sound together.

In addition, the specific working process of the user singing system for detecting and converting the singing voice of the user singing song in real time and sending the singing voice to the audio recognition system can also be as follows: the user singing system determines the detected continuous audio as singing sound and sends the singing sound to the audio recognition system. Specifically, the user singing system detects and converts the audio of the song sung by the user all the time, and when the user singing system does not detect any audio within a preset time, the continuous audio detected before the preset time is determined as the singing sound and is sent to the audio recognition system.

It should be noted that a song is often divided into a plurality of sentences, each sentence is sung in turn when a conversion user sings, and a certain time interval is left between every two sung sentences, so that the user singing system can determine a complete sentence as a singing voice and send the singing voice to the audio recognition system.

Further, the audio recognition system recognizes the singing voice through a preset neural network model, and the specific working process of determining the voice characteristics of the singing voice may be as follows: the audio recognition system inputs the singing voice into a preset neural network model, and determines the output of the preset neural network model as the voice characteristic of the singing voice. Specifically, the audio recognition system may input parameters such as an autocorrelation coefficient, a power spectral density function, a fourier transform frequency domain (including an imaginary part), and the like of the singing sound into a preset neural network model, and a pitch, a pinyin, and the like output by the preset neural network model are sound characteristics of the singing sound.

It should be noted that the neural network model is a mathematical model for processing information by applying a structure similar to brain neural synapse connection, and is formed by connecting a large number of nodes (or called neurons) with each other, and the network model achieves the purpose of processing information by adjusting the connection relationship among a large number of nodes inside depending on the complexity of the system, and has self-learning and self-adapting capabilities.

It should be further noted that the preset Neural Network model in the present invention may include a CNN (Convolutional Neural Network) and/or an RNN (Recurrent Neural Network), that is, in the present invention, the singing voice may be identified by the CNN alone, or the singing voice may be identified by the RNN alone, or the singing voice may be identified by combining the CNN and the RNN, when the voice feature of the singing voice includes pitch and pinyin, the CNN may be used to identify formants and pitch features, and the RNN may be used to identify pinyin.

Further, before the singing voice is recognized through the preset neural network model, the audio recognition system can also obtain a singing voice set, and parameters of the neural network are trained through the singing voice set to obtain the preset neural network model. Specifically, the audio recognition system may mark a sound feature of each singing sound in the set of singing sounds, input each singing sound in the set of singing sounds into the neural network, adjust a parameter of the neural network according to a difference between an output of the neural network and the marked sound feature, and determine the neural network as a preset neural network model when a difference between an output of the neural network and the marked sound feature is smaller than a preset parameter threshold after each singing sound in the set of singing sounds is input into the neural network.

The preset parameter threshold is a standard for judging the training degree of the neural network parameters, the preset parameter threshold can be preset, and the smaller the value of the preset parameter threshold is, the higher the accuracy of the finally trained preset neural network model for recognizing singing voice is. In addition, in order to ensure the accuracy of recognizing the singing voice by the preset neural network model, a large amount of singing voices are required to train the neural network, that is, the singing voices used for training the neural network should contain a large amount of singing voices in a centralized manner, and the more the singing voices contain in the centralized manner, the higher the accuracy of recognizing the singing voices by the finally trained preset neural network model is.

Further, the sound characteristics may include pitch and pinyin, and the instant synthesizer may record the human voice color library before determining the target human voice color library from the human voice color library. Referring to fig. 2, the specific operation process of the recording human voice color library may be: the method comprises the steps that a user records tone colors corresponding to first preset tones of all pinyin combinations of Chinese, whether the pitch of the tone color is the first preset pitch is identified, if the pitch of the tone color is not the first preset pitch, the pitch of the tone color is adjusted to the first preset pitch through a variable speed tone changing algorithm, then the tone colors of a second preset pitch and a second preset tone are expanded through the preset algorithm according to the pitch of the tone color and the tones of pinyin, and the pitch and the pinyin of each tone color are marked to generate a human voice tone color library.

The preset algorithm may be a PSOLA (Pitch Synchronous OverLap-and-Add) algorithm, and the relevant principle and the specific working process of the PSOLA algorithm may refer to the prior art, which is not described in detail herein.

It should be noted that the instant synthesizer needs to determine a target human voice color library from the human voice color library, and then call the color in the target human voice color library according to the voice characteristics, so before this, the instant synthesizer must first obtain the human voice color library, and at this time, the instant synthesizer may pre-record and store the human voice color library; or, the human voice color library may be pre-recorded by other computer equipment, and then the real-time synthesizer may directly obtain the recorded human voice color library from other computer equipment.

In addition, a plurality of voice and tone color libraries can be stored in the instant synthesizer in advance, each voice and tone color library is marked with information of a singer and singing method information, and then the instant synthesizer can determine a corresponding target voice and tone color library according to the selection of a user.

And the target human voice and tone library is a tone library recorded by a target user so as to convert the singing voice of the converted user into the singing voice of the target user. Because a user can use different singing methods (such as a Mei-Song singing method, a ethnic singing method and the like) when singing a song, the user can pre-input the tone color libraries of different singing methods, the target user and the conversion user are the same person, and the target person tone color library is the tone color library of different singing methods of the user. In addition, some people who lack enough singing skills can use the voice of other people to replace the singing voice of the people so as to beautify the singing voice of the people, the target user and the conversion user are not the same person, and the target person voice color library is the color library recorded by the target user.

It should be noted that all pinyin combinations of the chinese language refer to left and right combinations of all initials and finals; in addition, because each pinyin in the Chinese pinyin has 4 tones, if a user records all tones contained in each pinyin, the time and labor are wasted, so that the user only needs to record the first preset tone of each pinyin and then expands the pinyin with the second preset tone. Moreover, because each sound has many pitches, if the user records all the pitches of each pinyin, the user can spend time and labor, and the music bases of different users are different, so that the pitch during recording can not be well mastered, at this time, an instant synthesizer or computer equipment can judge whether the pitch recorded by the user is a first preset pitch, if not, the user can carry out standard adjustment on the pitch, adjust the pitch to the first preset pitch, and then expand the pinyin with a second preset pitch.

That is, the first preset tone may be preset, which is a tone selected by the user when recording each pinyin, for example, the preset tone may be one tone. The first preset pitch may also be preset, which is a pitch selected by the user when recording each pinyin, for example, the first preset pitch may be center C. The second preset pitch may also be preset, which is a pitch to be expanded, for example, the second preset pitch may be pitch a or pitch B. The second preset tone can also be preset, and is a tone that needs to be expanded, for example, the second preset tone can be two, three, or four.

For example, the first preset tone is one, the first preset pitch is center C, the second preset pitch is pitch a and pitch B, the second preset tone is two, three and four, and if pinyin hao is to be recorded, the user records the center C and hao of one, then expands the pitch a, hao of one, hao of two, hao of pitch a and three, hao of pitch a and four, hao of pitch B and two, hao of pitch B and three, hao of pitch B and four, hao of center C and two, hao of center C and three, hao of center C and four, and marks the pitch and the pitch of each expanded hao.

Further, the specific operation process of the instant synthesizer for determining the target human voice color library from the human voice color library may be: and the instant synthesizer determines a corresponding target human voice tone color library from a plurality of human voice tone color libraries stored by the instant synthesizer according to the singer information and the singing method information selected by the user.

Further, the real-time synthesizer calls the timbre in the target human voice timbre library according to the voice feature, and the specific operation process of synthesizing the new singing voice can be as follows: and the instant synthesizer searches the sound characteristics in the target human sound color library, and when the sound characteristics are searched, the tone corresponding to the sound characteristics is called to synthesize a new singing sound.

For example, the voice features are pitch a, hao of three voices, pitch B, duo of four voices, pitch a and hua of two voices, and the singer information and singing method information selected by the user are user a vocal singing method, at this time, the instant synthesizer finds a vocal color library corresponding to the user a vocal singing method in a plurality of vocal color libraries stored therein, searches the voice features in the vocal color library respectively, calls corresponding voice colors, and synthesizes the voice into a new singing voice.

In addition, when the immediate synthesizer cannot find the sound feature in the target person sound and color library, the tone with the similarity degree greater than the preset similarity threshold with the sound feature can be called to synthesize a new singing sound; alternatively, a synthesis failure signal may be returned and the user prompted to update the target person's voice tone library.

The similarity threshold may be preset to determine the similarity between two sound features, and when the similarity between two sound features is smaller than the similarity threshold, it indicates that the two sound features are similar, and thus the corresponding timbres are also similar.

It should be noted that, because there are relatively many combinations of the pinyin in chinese, a user may miss a certain pinyin combination during recording, and pronunciations of each user when pronouncing each pinyin may not be completely the same, if the user misrecognizes the pronunciation, the pinyin combination to be recorded may be missed, so that a subsequent instant synthesizer may not find a corresponding tone in the target person's voice tone library. When the instant synthesizer can not find the sound characteristics corresponding to the singing sound in the target person sound and color library, the tone corresponding to the sound characteristics similar to the sound characteristics can be called; a synthesis failure signal can also be returned, and the user is prompted to update the target human voice color library; the tone corresponding to the sound characteristic similar to the sound characteristic can be called, and after the new singing sound is synthesized, the user is prompted to update the target human sound and color library, so that when the tone in the target human sound and color library is incomplete, the sound can be normally synthesized in real time, the success rate of synthesizing the new singing sound is improved, and the user can timely know that the target human sound and color library is incomplete and timely update the new singing sound and color library.

For example, the voice features corresponding to the singing voice are center C, one-voice shen and center C, one-voice yin, but the real-time synthesizer cannot find the voice features in the target human voice color library, at this time, the real-time synthesizer determines that the voice features with the similarity smaller than the similarity threshold value with the voice features are center C, one-voice sheng and center C, one-voice yin, and at this time, the real-time synthesizer can call the voice features C, one-voice sheng and the corresponding colors of center C and one-voice yin to synthesize a new singing voice, and after the synthesis is completed, the user is prompted to update the target human voice color library.

It is worth to be noted that, in the invention, the tone in the target human voice tone library is used for replacing the singing voice of the conversion user, namely, homologous substitution, but not the change of the singing voice is realized by changing the audio signal, the singing voice of the conversion user can be thoroughly filtered, and the accuracy rate of synthesizing the new singing voice is improved; in addition, the tone covering all pinyin and pitch is generated in advance, and the new tone is generated without carrying out variable speed tone change in real time, so that the tone generated in advance can be called in real time while the user sings the song, the new singing sound is synthesized in real time, and the synthesis is not carried out after the user sings, thereby reducing the calculation time before the synthesis, quickening the speed of synthesizing the new singing sound, namely improving the efficiency of synthesizing the new singing sound and ensuring the timeliness of the synthesis.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A singing sound converter, characterized in that it comprises: the system comprises a user singing system, an audio recognition system, an instant synthesizer, a human voice tone library and a player;

and the player is used for playing the new singing sound in real time.

2. The singing voice converter according to claim 1, wherein the detecting in real time the singing voice of a song being performed by a user and transmitting the singing voice to the audio recognition system comprises:

3. The singing voice converter according to claim 1, wherein the identifying the singing voice through a preset neural network model and determining the voice characteristics of the singing voice comprises:

4. The singing voice converter according to claim 1, wherein before the recognizing the singing voice through the preset neural network model, the method further comprises:

5. The singing voice converter according to claim 4, wherein the training of the parameters of the neural network through the set of singing voices to obtain the preset neural network model comprises:

6. The singing sound converter according to claim 1, wherein the sound features include pitch and pinyin, and wherein before the step of determining the target human voice color library from the human voice color library, the method further comprises:

7. The singing sound converter according to claim 1, wherein said invoking timbres in the target human voice timbre library according to the sound features to synthesize new singing sounds comprises:

searching the sound characteristics in the target person sound color library;

8. The singing sound converter according to claim 7, wherein after searching the sound feature in the target person sound color library, further comprising:

9. The singing sound converter according to claim 7, wherein after searching the sound feature in the target person sound color library, further comprising: