CN113362807A - Real-time sound changing method and device and electronic equipment - Google Patents

Real-time sound changing method and device and electronic equipment Download PDF

Info

Publication number
CN113362807A
CN113362807A CN202110463732.4A CN202110463732A CN113362807A CN 113362807 A CN113362807 A CN 113362807A CN 202110463732 A CN202110463732 A CN 202110463732A CN 113362807 A CN113362807 A CN 113362807A
Authority
CN
China
Prior art keywords
model
acoustic
target
training sample
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110463732.4A
Other languages
Chinese (zh)
Inventor
戈文硕
刘恺
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Intelligent Technology Co Ltd
Original Assignee
Beijing Sogou Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Intelligent Technology Co Ltd filed Critical Beijing Sogou Intelligent Technology Co Ltd
Priority to CN202110463732.4A priority Critical patent/CN113362807A/en
Publication of CN113362807A publication Critical patent/CN113362807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses a real-time voice changing method, which comprises the steps of obtaining original voice data of a source speaker; extracting original audio recognition characteristics through a voice recognition model; inputting the original audio recognition characteristics into a target sound variation model, and outputting the acoustic characteristics of the target speaker; and outputting the acoustic characteristics of the target speaker in the target voice. In the technical scheme, the parameter quantity of the voice recognition model is smaller than the first set parameter quantity, and the parameter quantity of the target sound-changing model is smaller than the second set parameter quantity, so that the voice recognition model and the target sound-changing model are small models, and the streaming scheduling feature extraction is adopted, therefore, the calculated quantity can be greatly reduced, and the effect of real-time sound-changing with low response delay can be realized.

Description

Real-time sound changing method and device and electronic equipment
Technical Field
The present invention relates to the field of speech technologies, and in particular, to a method and an apparatus for changing a sound in real time, and an electronic device.
Background
With the rapid development of speech recognition technology, speech recognition is more widely applied, such as real-time speech translation and voice change, and when the voice change technology is used, parallel corpora of a source speaker and a target speaker are generally used, and then the parallel corpora are aligned and trained to obtain a voice change model, so that the voice change model is obtained through training to complete voice change.
In the prior art, a large amount of parallel linguistic data needs to be collected for a voice change technology model based on recognition, and then the voice change model is obtained after training, so that the voice change model is usually a large-scale model, and real-time voice change on hardware with extremely low memory and computing resources is difficult to achieve.
Disclosure of Invention
The embodiment of the invention provides a real-time sound changing method, a real-time sound changing device and electronic equipment, which can change sound in real time on hardware with extremely low memory and computing resources.
The first aspect of the embodiments of the present invention provides a method for changing sound in real time, where the method includes:
acquiring original voice data of a source speaker;
extracting original audio recognition characteristics of the original audio data through a speech recognition model, wherein the parameter quantity of the speech recognition model is smaller than a first set parameter quantity;
inputting the original audio recognition characteristics into a target sound variation model, and outputting the acoustic characteristics of the target speaker, wherein the parameter quantity of the target sound variation model is smaller than a second set parameter quantity;
and outputting the acoustic characteristics of the target speaker in the target voice.
Optionally, the training step of the target acoustic variation model includes:
acquiring a training sample set, wherein the training sample set comprises voice data of at least one speaker;
for each training sample in the training sample set, inputting the voice data of the training sample into the voice recognition model for feature extraction, extracting the audio recognition feature of the training sample, and extracting the acoustic feature of the training sample;
and performing model training according to the audio recognition characteristics and the acoustic characteristics of each training sample to obtain the target sound variation model.
Optionally, the performing model training according to the audio recognition features and the acoustic features of each training sample to obtain the target sound variation model includes:
and aiming at each training sample, using the audio recognition characteristics of the training sample as the input data of the model, using the acoustic characteristics of the training sample as the output data of the model to carry out model training to obtain a trained acoustic variation model, and using the trained acoustic variation model as the target acoustic variation model.
Optionally, after obtaining the trained acoustic change model, the method further includes:
acquiring voice data of the target speaker;
inputting the voice data of the target speaker into the voice recognition model for feature extraction, and extracting the audio recognition feature of the target speaker and the acoustic feature of the target speaker;
and carrying out self-adaptive training on the trained acoustic variation model by utilizing the audio recognition characteristic and the acoustic characteristic of the target speaker to obtain a self-adaptive acoustic variation model, and taking the self-adaptive acoustic variation model as the target acoustic variation model.
Optionally, the outputting the acoustic feature of the target speaker as the target speech includes:
and inputting the acoustic characteristics of the target speaker into a vocoder to output the target voice.
The second aspect of the embodiments of the present invention further provides a device for changing sound in real time, including:
the voice data acquisition unit is used for acquiring original voice data of a source speaker;
the characteristic extraction unit is used for extracting original audio recognition characteristics of the original audio data through a voice recognition model, wherein the parameter quantity of the voice recognition model is smaller than a first set parameter quantity;
the model prediction unit is used for inputting the original audio recognition characteristics into a target sound variation model and outputting the acoustic characteristics of a target speaker, wherein the parameter quantity of the target sound variation model is smaller than a second set parameter quantity;
and the voice output unit is used for outputting the acoustic characteristics of the target speaker as the target voice.
Optionally, the method further includes:
the model training unit is used for acquiring a training sample set, and the training sample set comprises voice data of at least one speaker; for each training sample in the training sample set, inputting the voice data of the training sample into the voice recognition model for feature extraction, extracting the audio recognition feature of the training sample, and extracting the acoustic feature of the training sample; and performing model training according to the audio recognition characteristics and the acoustic characteristics of each training sample to obtain the target sound variation model.
Optionally, the model training unit is configured to, for each training sample, use the audio recognition features of the training sample as input data of a model, use the acoustic features of the training sample as output data of the model to perform model training, obtain a trained acoustic variation model, and use the trained acoustic variation model as the target acoustic variation model.
Optionally, the model training unit is configured to obtain speech data of the target speaker after obtaining the trained acoustic change model; inputting the voice data of the target speaker into the voice recognition model for feature extraction, and extracting the audio recognition feature of the target speaker and the acoustic feature of the target speaker; and carrying out self-adaptive training on the trained acoustic variation model by utilizing the audio recognition characteristic and the acoustic characteristic of the target speaker to obtain a self-adaptive acoustic variation model, and taking the self-adaptive acoustic variation model as the target acoustic variation model.
Optionally, the voice output unit is configured to input the acoustic feature of the target speaker into a vocoder to output the target voice.
A third aspect of the embodiments of the present invention provides an electronic device, including a memory and one or more programs, where the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operation instructions included in the one or more programs for performing the real-time sound-changing method according to the first aspect.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps corresponding to the real-time sound-changing method provided in the first aspect.
The above one or at least one technical solution in the embodiments of the present application has at least the following technical effects:
based on the technical scheme, original voice data of a source speaker are input into a voice recognition model for feature extraction, the extracted original voice recognition features are input into a target voice changing model, the acoustic features of the target speaker are output, and then the acoustic features of the target speaker are output in the form of the target voice; at the moment, because the parameter quantity of the voice recognition model is smaller than the first set parameter quantity and the parameter quantity of the target sound variation model is smaller than the second set parameter quantity, the voice recognition model and the target sound variation model are both small models, and the streaming scheduling feature extraction is adopted, the features extracted by the voice recognition model are input into the target sound variation model, then the acoustic features of the target speaker predicted by the target sound variation model are input into the vocoder, and the times of feature extraction are reduced; therefore, on the basis that the speech recognition model and the target sound variation model are small models and the number of times of feature extraction is reduced, the calculated amount can be greatly reduced, and the effect of real-time sound variation with low response delay can be achieved.
Drawings
Fig. 1 is a schematic flowchart of a real-time sound-changing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for training a target acoustic variation model according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a target acoustic change model adaptive training method according to an embodiment of the present disclosure;
fig. 4 is a block diagram of a device for changing sound in real time according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the technical scheme provided by the embodiment of the application, a real-time voice changing method is provided, wherein original voice data of a source speaker is input into a voice recognition model for feature extraction, the extracted original voice recognition features are input into a target voice changing model, the acoustic features of a target speaker are output, and then the acoustic features of the target speaker are output by the target voice; at the moment, because the parameter quantity of the voice recognition model is smaller than the first set parameter quantity and the parameter quantity of the target sound variation model is smaller than the second set parameter quantity, the voice recognition model and the target sound variation model are both small models, and the streaming scheduling feature extraction is adopted, the features extracted by the voice recognition model are input into the target sound variation model, then the acoustic features of the target speaker predicted by the target sound variation model are input into the vocoder, and the times of feature extraction are reduced; therefore, on the basis that the voice recognition model and the target sound variation model are small models and the frequency of feature extraction is reduced, the calculation amount can be greatly reduced, and the problem that real-time sound variation is difficult to realize on hardware with extremely low memory and calculation resources in the prior art is solved.
The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical scheme of the embodiment of the present application are explained in detail with reference to the accompanying drawings.
Examples
Referring to fig. 1, an embodiment of the present application provides a real-time sound-changing method, where the method includes:
s101, acquiring original voice data of a source speaker;
s102, extracting original audio recognition characteristics of the original audio data through a voice recognition model, wherein the parameter quantity of the voice recognition model is smaller than a first set parameter quantity;
s103, inputting the original audio recognition features into a target sound variation model, and outputting the acoustic features of the target speaker, wherein the parameter quantity of the target sound variation model is smaller than a second set parameter quantity;
and S104, outputting the acoustic characteristics of the target speaker as the target voice.
In step S101, a source speaker is determined, and after the source speaker is determined, voice data of the source speaker is acquired as original voice data. And the target speaker may also be determined before or after the source speaker is determined. Wherein, the source speaker and the target speaker can be determined by the user, and can also be determined according to the actual situation. In the following, the target speaker is determined first, and then the source speaker is determined.
For example, when a confirmation instruction that a target speaker A is designated by a user is received, the target speaker A is determined, and after a confirmation instruction that a source speaker B is designated by the user is received, the source speaker B is determined, and voice data of the source speaker B is collected as original voice data.
In the embodiments of the present specification, the source speaker and the target speaker are different speakers.
After the original voice data is acquired, step S102 is executed.
Before step S102, a speech recognition model needs to be trained in advance to obtain a speech recognition model, and then original speech data is input into the speech recognition model for feature extraction, so as to extract an original audio recognition feature, where a parameter quantity of the speech recognition model is smaller than a first set parameter quantity.
Specifically, in order to enable the speech recognition model to be deployed in the terminal for real-time calculation, the parameter quantity of the speech recognition model may be controlled to be smaller than a first set parameter quantity, so that the speech recognition model is a small model and can be deployed on a hardware terminal with extremely low memory and computing resources, and the first set parameter quantity may be, for example, 1M to 8M, and may be, for example, 1M, 5M, 6M, and the like.
In the embodiment of the present specification, the speech recognition model may be a general recognition model, for example, a neural network-based time series class classification (CTC) model, a Long Short Term Memory (LSTM), a CNN model, a CLDNN model, and the like, and the present specification is not limited specifically.
And after determining the universal recognition model, for example, determining the universal recognition as LSTM, performing model training by using the voice data of at least one speaker to obtain the voice recognition model, wherein the voice recognition model can be LSTM with project layer in 3 layers.
Specifically, after a speech recognition model is obtained through training, original speech data are input into the speech recognition model for feature extraction, and features of a specified hidden layer of the speech recognition model are used as original audio recognition features, wherein the specified hidden layer comprises the last hidden layer of the speech recognition model. Of course, the designated hidden layer may also include one or more hidden layers before the last hidden layer, and the designated hidden layer may be, for example, the last hidden layer and the previous hidden layer before the last hidden layer, etc.
In the embodiment of the present specification, the original audio identification feature is usually a fbank feature, for example, the fbank feature may be 71 or 65 dimensions; the acoustic features are typically mel-frequency spectral features, for example the mel-frequency spectral features may be 80 or 72 dimensions; and the original audio recognition features typically have different sound characteristics than the acoustic features. Of course, the original audio identification feature may also be the same sound feature as the acoustic feature, but the sound dimension features may be different, for example, the original audio identification feature is a 72-dimensional feature and the acoustic feature is a 62-dimensional feature.
After the original audio recognition feature is acquired through step S102, step S103 is performed.
Before step S103 is executed, a target acoustic change model needs to be trained, and after the target acoustic change model is obtained through training, the original audio recognition features are input into the target acoustic change model, and the acoustic features of the target speaker are output, wherein the parameter quantity of the target acoustic change model is smaller than the second set parameter quantity.
In this embodiment of the present specification, in order to enable the target acoustic variation model to be deployed in a terminal for real-time calculation, a parameter quantity of the target acoustic variation model may be controlled to be smaller than a second set parameter quantity, so that the target acoustic variation model is a small model and can be deployed on a hardware terminal with an extremely low memory and computing resources, where the second set parameter quantity may be, for example, between 0.5M and 4M, and may be, for example, 0.8M, 1M, 1.6M, and the like.
Specifically, referring to fig. 2, the training step of the objective acoustic variation model includes:
s201, obtaining a training sample set, wherein the training sample set comprises voice data of at least one speaker;
s202, inputting the voice data of the training samples into the voice recognition model for feature extraction, extracting the audio recognition features of the training samples and extracting the acoustic features of the training samples aiming at each training sample in the training sample set;
s203, performing model training according to the audio recognition characteristics and the acoustic characteristics of each training sample to obtain the target sound variation model.
In step S201, in the process of obtaining a training sample set, collecting voice data of at least one speaker, and constructing a corpus according to the collected voice data of the at least one speaker; and acquiring a training sample set according to the constructed corpus, wherein the training sample set comprises voice data of the target speaker. Of course, the training sample set can also be obtained directly according to the collected voice data of at least one speaker.
After the training sample set is acquired, step S202 is performed.
In step S202, the speech recognition model in step S102 is first obtained, and after the speech recognition model is obtained, for each training sample in the training sample set, the speech data of the training sample is input into the speech recognition model for feature extraction, the audio recognition features of the training sample are extracted, and the acoustic features of the training sample are extracted. In this manner, audio recognition features and acoustic features for each training sample may be extracted.
Of course, the above operations may also be performed for a part of the training samples in the training sample set, so that the audio recognition features and the acoustic features of each of the part of the training samples may be extracted.
Specifically, when extracting the acoustic features of the training sample, feature extraction may be performed on the speech data of the training sample by using Mel-scale frequency Cepstral Coefficients (MFCC for short), and then the acoustic features of the training sample may be extracted.
And inputting the voice data of the training samples into the voice recognition model for feature extraction aiming at each training sample, and extracting the audio recognition features of the training samples. For a specific implementation of extracting the audio recognition features of the training samples, reference may be made to the step of extracting the original audio recognition features in step S102, and for simplicity of the description, details are not repeated here.
In this embodiment of the present specification, in the process of extracting the audio recognition features of the training samples for each training sample, if the audio recognition features of the training samples include features of a last layer and features of a layer before the last layer in the speech recognition model, in this case, in the training process, for each training sample, a convolutional layer having a first convolutional structure is created for the features of the last layer of the training sample, another convolutional layer having a second convolutional structure is created for the features of the layer before the last layer of the training sample, and then training is performed, where the first convolutional structure and the second convolutional structure are different. Therefore, in the training process of the standard-variation voice model, different convolution structures are adopted for different characteristics output by the voice recognition model, the variable voice tone quality and the similarity are improved by methods of using a sub-band countermeasure training method, using multi-person data to pre-train the variable voice model and then using target speaker data to carry out self-adaptation, and the like, and the prediction accuracy of the target variable voice model is ensured.
For example, taking a training sample as an example, inputting the training sample into a speech recognition model for feature extraction, where the extracted audio recognition features include an ASR one-hot feature obtained by processing the last layer of the speech recognition model after passing through a softmax layer, and an ASR bottleeck feature obtained by recognizing the previous layer of the model output layer; and inputting the ASR one-hot features into a convolution layer of a first convolution structure, and inputting the ASR bottleeck features into a convolution layer of a second convolution structure for model training.
After the audio recognition features and the acoustic features of each training sample are acquired, step S203 is performed.
In step S203, for each training sample, the audio recognition features of the training sample are used as input data of the model, the acoustic features of the training sample are used as output data of the model for model training, a trained acoustic variation model is obtained, and the trained acoustic variation model is used as a target acoustic variation model.
Specifically, a trained acoustic variation model can be obtained by training in a countertraining manner, for example, the acoustic variation model is identified by G, and the discriminator is represented by D, and for each training sample, the audio recognition features of the training sample are input into G to obtain output acoustic features; and then, distinguishing the acoustic features of the output acoustic features and the acoustic features of the training samples by using D, and finally enabling D to be incapable of distinguishing the acoustic features of the output acoustic features and the acoustic features of the training samples in the continuous impedance optimization of G and D, or enabling D to meet a constraint condition aiming at the distinguishing rate of the acoustic features of the output acoustic features and the acoustic features of the training samples, wherein the acoustic features of the output acoustic features of G are extremely similar to the acoustic features of the training samples, and the G at the moment is used as a trained sound variation model, namely a target sound variation model.
Due to the fact that the mode training is carried out in the countercheck training mode, the accuracy of the output acoustic features predicted by the target sound variation model obtained through the countercheck training can be higher.
In this way, after the target sound variation model is obtained through the training in steps S201 to S203, since the target sound variation model is obtained through the countermeasure training, the accuracy of the output acoustic features predicted by the target sound variation model is higher; in this manner, the acoustic features of the target speaker are output upon inputting the original audio recognition features into the target voicing model. In this case, since the accuracy of the output acoustic feature predicted by the target unvoiced sound model is higher, the accuracy of the acoustic feature of the target speaker predicted by using the target unvoiced sound model is also improved.
In another embodiment of the present specification, after obtaining the trained acoustic variation model, as shown in fig. 3, the method further includes:
s301, acquiring voice data of a target speaker;
s302, inputting voice data of a target speaker into a voice recognition model for feature extraction, and extracting audio recognition features of the target speaker and acoustic features of the target speaker;
s303, carrying out self-adaptive training on the trained acoustic variation model by utilizing the audio recognition characteristics and the acoustic characteristics of the target speaker to obtain a self-adaptive acoustic variation model, and taking the self-adaptive acoustic variation model as a target acoustic variation model.
In step S301, the voice data of the target speaker may be acquired according to the target speaker specified in step S101.
After the voice data of the target speaker is acquired, step S302 is performed.
In this step, the voice data of the target speaker may be input into the voice recognition model for feature extraction, the audio recognition feature of the target speaker may be extracted, and the acoustic feature of the target speaker may be extracted. In this way, the audio recognition features and acoustic features of the target speaker can be extracted.
Specifically, when extracting the acoustic features of the target speaker, the voice data of the target speaker may be feature extracted by the MFCC, and the acoustic features of the target speaker may be further extracted.
And inputting the voice data of the target speaker into the voice recognition model for feature extraction, and extracting the audio recognition features of the target speaker. For a specific implementation of extracting the audio recognition feature of the target speaker, reference may be made to the step of extracting the original audio recognition feature in step S102, and for simplicity of the description, details are not repeated here.
And performing step S303 after the audio recognition feature and the acoustic feature of the target speaker are extracted.
In the step, the trained acoustic change model is adaptively trained by utilizing the audio recognition characteristics and the acoustic characteristics of the target speaker to obtain an adaptive acoustic change model, and the adaptive acoustic change model is used as the target acoustic change model.
Thus, after the trained voice-changing model is obtained by performing model training on the voice data of at least one speaker, namely, the voice-changing model is pre-trained, and the obtained trained voice-changing model is used as a pre-training model; after the pre-training model is obtained, the voice data of the target speaker is used, and the pre-training model is subjected to adaptive training by adopting the same method as the pre-training to obtain an adaptive sound-changing model; at this time, the adaptive sound varying model has a higher matching degree with the target speaker, so that the prediction accuracy of the target speaker is higher when the adaptive sound varying model is used for predicting the target speaker.
In this way, after the target sound variation model is obtained through the training of steps S301 to S303, the original audio recognition features are input into the target sound variation model, and the acoustic features of the target speaker are output. At this time, since the matching degree between the target varying-sound model and the target speaker is high, the prediction accuracy of the target speaker is higher when the adaptive varying-sound model is used for predicting the target speaker.
After the acoustic features of the target speaker are predicted by the target voicing model, step S104 is performed.
In this step, the acoustic characteristics of the target speaker may be input into a vocoder, which may be, for example, a melgan vocoder or the like, to be output as the target voice.
Specifically, the acoustic characteristics of the target speaker are input into the vocoder to generate a voice signal carrying the target voice, and the voice signal is output, so that the voice of the arbitrary source speaker can be converted into the target voice to be output.
For example, in the adaptive training stage of the target unvoiced sound model, if the target speaker is determined to be a, the voice data of a is firstly acquired, the voice data line features of a are firstly extracted, and 71-dimensional fbank features are extracted; inputting 71-dimensional fbank features into a trained speech recognition model, and obtaining corresponding features from a hidden layer (a layer before a last layer) and the last layer of the speech recognition model as audio recognition features, wherein the features are represented by A1; and extracting 80-dimensional mel spectrum features output by the target acoustic variation model from the voice data of the A, expressing the feature by A2, and then carrying out self-adaptive training on the pre-trained acoustic variation model by taking A1 as input data and A2 as output data to obtain the self-adaptive acoustic variation model serving as the target acoustic variation model.
And in the phase of changing voice by adopting a target changing voice model, inputting voice data of a source speaker B (the source speaker can be any speaker), firstly, extracting the characteristics of the voice of the speaker B, and extracting 71-dimensional fbank characteristics; inputting 71-dimensional fbank features into a speech recognition model, and obtaining corresponding features from a hidden layer (a layer before a last layer) and the last layer of the speech recognition model as audio recognition features which are represented by B1; the output 80-dimensional mel spectral features are represented by B2 when B1 is input into the target acoustic variation model. B2 is the acoustic characteristics of speaker A; then B2 is input into the vocoder to be restored into corresponding sound, namely, the voice data of B is output as the voice of A.
In the practical application process, because the data volume of a target speaker is small and a trained target sound-changing model is not stable, the voice data of at least one speaker is required to be used as input data through the audio recognition characteristic of the voice recognition model, the mel spectrum of the voice data of at least one speaker is required to be used as output data, and the sound-changing model is pre-trained to obtain a pre-trained sound-changing model; then, the voice data of the target A is used for carrying out self-adaptation on the sound changing model according to the method.
In the embodiment of the specification, because only the audio data of the target speaker is needed and the parallel corpus of the source speaker is not needed when the target variable acoustic model is trained, compared with the prior art, the parallel corpus and the feature alignment are not needed, and the data acquisition cost is lower; the parameters of the target sound variation model and the voice recognition model are small, the target sound variation model and the voice recognition model can be deployed on hardware with low memory and computing resources, and then offline real-time sound variation service can be deployed at the end part of the box, the effect is stable, and the problems of network congestion, high server resource consumption and the like easily faced by online service are reduced; and stream scheduling feature extraction is adopted, the features extracted by the voice recognition model are input into the target voice-changing model, and then the acoustic features of the target speaker predicted by the target voice-changing model are input into the vocoder, so that the real-time voice changing with low response delay can be realized.
Based on the technical scheme, after original voice data of a source speaker is obtained, original audio recognition characteristics of the original voice data are extracted through a voice recognition model, the extracted original audio recognition characteristics are input into a target voice changing model, acoustic characteristics of the target speaker are output, and then the acoustic characteristics of the target speaker are output in the form of the target voice; at the moment, because the parameter quantity of the voice recognition model is smaller than the first set parameter quantity and the parameter quantity of the target sound variation model is smaller than the second set parameter quantity, the voice recognition model and the target sound variation model are both small models, and the streaming scheduling feature extraction is adopted, the features extracted by the voice recognition model are input into the target sound variation model, then the acoustic features of the target speaker predicted by the target sound variation model are input into the vocoder, and the times of feature extraction are reduced; therefore, on the basis that the speech recognition model and the target sound variation model are small models and the number of times of feature extraction is reduced, the calculated amount can be greatly reduced, and the effect of real-time sound variation with low response delay can be achieved.
To the above embodiment, a real-time sound changing method is provided, and an embodiment of the present application also provides a real-time sound changing apparatus, please refer to fig. 4, where the apparatus includes:
a voice data acquisition unit 401, configured to acquire original voice data of a source speaker;
a feature extraction unit 402, configured to extract an original audio recognition feature of the original audio data through a speech recognition model, where a parameter quantity of the speech recognition model is smaller than a first set parameter quantity;
a model prediction unit 403, configured to input the original audio recognition feature into a target acoustic variation model, and output an acoustic feature of the target speaker, where a parameter quantity of the target acoustic variation model is smaller than a second set parameter quantity;
a speech output unit 404, configured to output the acoustic feature of the target speaker as the target speech.
In an alternative embodiment, the apparatus further comprises:
the model training unit is used for acquiring a training sample set, and the training sample set comprises voice data of at least one speaker; for each training sample in the training sample set, inputting the voice data of the training sample into the voice recognition model for feature extraction, extracting the audio recognition feature of the training sample, and extracting the acoustic feature of the training sample; and performing model training according to the audio recognition characteristics and the acoustic characteristics of each training sample to obtain the target sound variation model.
In an optional implementation manner, the model training unit is configured to, for each training sample, perform model training using the audio recognition features of the training sample as input data of a model, using the acoustic features of the training sample as output data of the model, to obtain a trained acoustic variation model, and using the trained acoustic variation model as the target acoustic variation model.
In an alternative embodiment, the model training unit is configured to obtain the speech data of the target speaker after obtaining the trained unvoiced model; inputting the voice data of the target speaker into the voice recognition model for feature extraction, and extracting the audio recognition feature of the target speaker and the acoustic feature of the target speaker; and carrying out self-adaptive training on the trained acoustic variation model by utilizing the audio recognition characteristic and the acoustic characteristic of the target speaker to obtain a self-adaptive acoustic variation model, and taking the self-adaptive acoustic variation model as the target acoustic variation model.
In an alternative embodiment, the speech output unit 404 is configured to input the acoustic feature of the target speaker into the vocoder to output the target speech.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 5 is a block diagram of an electronic device 800 illustrating a method for real-time voicing, according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, the electronic device 800 may include one or at least one of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/presentation (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or at least one processor 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or at least one module that facilitates interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides a presentation interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 also includes a speaker for presenting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of real-time voicing, the method comprising:
acquiring original voice data of a source speaker;
extracting original audio recognition characteristics of the original audio data through a speech recognition model, wherein the parameter quantity of the speech recognition model is smaller than a first set parameter quantity;
inputting the original audio recognition characteristics into a target sound variation model, and outputting the acoustic characteristics of the target speaker, wherein the parameter quantity of the target sound variation model is smaller than a second set parameter quantity;
and outputting the acoustic characteristics of the target speaker in the target voice.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (11)

1. A method of real-time voicing, the method comprising:
acquiring original voice data of a source speaker;
extracting original audio recognition characteristics of the original audio data through a speech recognition model, wherein the parameter quantity of the speech recognition model is smaller than a first set parameter quantity;
inputting the original audio recognition characteristics into a target sound variation model, and outputting acoustic characteristics of a target speaker, wherein the parameter quantity of the target sound variation model is smaller than a second set parameter quantity;
and outputting the acoustic characteristics of the target speaker in the target voice.
2. The method of claim 1, wherein the step of training the target voicing model comprises:
acquiring a training sample set, wherein the training sample set comprises voice data of at least one speaker;
for each training sample in the training sample set, inputting the voice data of the training sample into the voice recognition model for feature extraction, extracting the audio recognition feature of the training sample, and extracting the acoustic feature of the training sample;
and performing model training according to the audio recognition characteristics and the acoustic characteristics of each training sample to obtain the target sound variation model.
3. The method of claim 2, wherein the model training based on the audio recognition features and the acoustic features of each training sample to obtain the target sound variation model comprises:
and aiming at each training sample, using the audio recognition characteristics of the training sample as the input data of the model, using the acoustic characteristics of the training sample as the output data of the model to carry out model training to obtain a trained acoustic variation model, and using the trained acoustic variation model as the target acoustic variation model.
4. The method of claim 3, wherein after obtaining the trained acoustic models, the method further comprises:
acquiring voice data of the target speaker;
inputting the voice data of the target speaker into the voice recognition model for feature extraction, and extracting the audio recognition feature of the target speaker and the acoustic feature of the target speaker;
and carrying out self-adaptive training on the trained acoustic variation model by utilizing the audio recognition characteristic and the acoustic characteristic of the target speaker to obtain a self-adaptive acoustic variation model, and taking the self-adaptive acoustic variation model as the target acoustic variation model.
5. The method of claim 1, wherein outputting the acoustic features of the target speaker as the target speech comprises:
and inputting the acoustic characteristics of the target speaker into a vocoder to output the target voice.
6. A device for changing sound in real time, comprising:
the voice data acquisition unit is used for acquiring original voice data of a source speaker;
the characteristic extraction unit is used for extracting original audio recognition characteristics of the original audio data through a voice recognition model, wherein the parameter quantity of the voice recognition model is smaller than a first set parameter quantity;
the model prediction unit is used for inputting the original audio recognition characteristics into a target sound variation model and outputting the acoustic characteristics of a target speaker, wherein the parameter quantity of the target sound variation model is smaller than a second set parameter quantity;
and the voice output unit is used for outputting the acoustic characteristics of the target speaker as the target voice.
7. The apparatus of claim 6, further comprising:
the model training unit is used for acquiring a training sample set, and the training sample set comprises voice data of at least one speaker; for each training sample in the training sample set, inputting the voice data of the training sample into the voice recognition model for feature extraction, extracting the audio recognition feature of the training sample, and extracting the acoustic feature of the training sample; and performing model training according to the audio recognition characteristics and the acoustic characteristics of each training sample to obtain the target sound variation model.
8. The apparatus of claim 7, wherein the model training unit is configured to, for each training sample, perform model training using audio recognition features of the training sample as input data of a model, using acoustic features of the training sample as output data of the model, to obtain a trained acoustic variation model, and using the trained acoustic variation model as the target acoustic variation model.
9. The apparatus of claim 8, wherein the model training unit is configured to obtain speech data of the target speaker after obtaining the trained voicing model; inputting the voice data of the target speaker into the voice recognition model for feature extraction, and extracting the audio recognition feature of the target speaker and the acoustic feature of the target speaker; and carrying out self-adaptive training on the trained acoustic variation model by utilizing the audio recognition characteristic and the acoustic characteristic of the target speaker to obtain a self-adaptive acoustic variation model, and taking the self-adaptive acoustic variation model as the target acoustic variation model.
10. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors to execute operating instructions included in the one or more programs for performing the corresponding method according to any one of claims 1 to 5.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1 to 5.
CN202110463732.4A 2021-04-26 2021-04-26 Real-time sound changing method and device and electronic equipment Pending CN113362807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110463732.4A CN113362807A (en) 2021-04-26 2021-04-26 Real-time sound changing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110463732.4A CN113362807A (en) 2021-04-26 2021-04-26 Real-time sound changing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN113362807A true CN113362807A (en) 2021-09-07

Family

ID=77525530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110463732.4A Pending CN113362807A (en) 2021-04-26 2021-04-26 Real-time sound changing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113362807A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015099304A (en) * 2013-11-20 2015-05-28 日本電信電話株式会社 Sympathy/antipathy location detecting apparatus, sympathy/antipathy location detecting method, and program
CN107180628A (en) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model
CN111508511A (en) * 2019-01-30 2020-08-07 北京搜狗科技发展有限公司 Real-time sound changing method and device
CN111722696A (en) * 2020-06-17 2020-09-29 苏州思必驰信息科技有限公司 Voice data processing method and device for low-power-consumption equipment
CN112259106A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voiceprint recognition method and device, storage medium and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015099304A (en) * 2013-11-20 2015-05-28 日本電信電話株式会社 Sympathy/antipathy location detecting apparatus, sympathy/antipathy location detecting method, and program
CN107180628A (en) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model
CN111508511A (en) * 2019-01-30 2020-08-07 北京搜狗科技发展有限公司 Real-time sound changing method and device
CN111722696A (en) * 2020-06-17 2020-09-29 苏州思必驰信息科技有限公司 Voice data processing method and device for low-power-consumption equipment
CN112259106A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voiceprint recognition method and device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN109801644B (en) Separation method, separation device, electronic equipment and readable medium for mixed sound signal
CN113362812B (en) Voice recognition method and device and electronic equipment
CN110097890B (en) Voice processing method and device for voice processing
CN111583944A (en) Sound changing method and device
CN109360197B (en) Image processing method and device, electronic equipment and storage medium
CN107945806B (en) User identification method and device based on sound characteristics
CN113707134B (en) Model training method and device for model training
CN108364635B (en) Voice recognition method and device
CN113223542B (en) Audio conversion method and device, storage medium and electronic equipment
US11354520B2 (en) Data processing method and apparatus providing translation based on acoustic model, and storage medium
CN110930978A (en) Language identification method and device and language identification device
CN113362813A (en) Voice recognition method and device and electronic equipment
CN107437412B (en) Acoustic model processing method, voice synthesis method, device and related equipment
WO2022147692A1 (en) Voice command recognition method, electronic device and non-transitory computer-readable storage medium
CN111739535A (en) Voice recognition method and device and electronic equipment
CN113345452B (en) Voice conversion method, training method, device and medium of voice conversion model
CN109102813B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN113409765B (en) Speech synthesis method and device for speech synthesis
CN117642817A (en) Method, device and storage medium for identifying audio data category
CN113362807A (en) Real-time sound changing method and device and electronic equipment
CN113923517A (en) Background music generation method and device and electronic equipment
CN112818841A (en) Method and related device for recognizing user emotion
CN109102810B (en) Voiceprint recognition method and device
CN108345590B (en) Translation method, translation device, electronic equipment and storage medium
CN113345451B (en) Sound changing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination