CN116092478A - Voice emotion conversion method, device, equipment and storage medium - Google Patents

Voice emotion conversion method, device, equipment and storage medium Download PDF

Info

Publication number
CN116092478A
CN116092478A CN202310152979.3A CN202310152979A CN116092478A CN 116092478 A CN116092478 A CN 116092478A CN 202310152979 A CN202310152979 A CN 202310152979A CN 116092478 A CN116092478 A CN 116092478A
Authority
CN
China
Prior art keywords
discrete
emotion
pronunciation unit
pronunciation
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310152979.3A
Other languages
Chinese (zh)
Inventor
陈闽川
马骏
王少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310152979.3A priority Critical patent/CN116092478A/en
Publication of CN116092478A publication Critical patent/CN116092478A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to an artificial intelligence technology and provides a method, a device, equipment and a storage medium for voice emotion conversion, wherein the method comprises the following steps: encoding an input original speech signal into a first sequence of discrete pronunciation units having an original emotion; converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to the target emotion label; based on the target emotion label, performing prosody characteristic prediction on the second discrete pronunciation unit sequence to obtain predicted prosody characteristics, wherein the predicted prosody characteristics comprise predicted duration and predicted fundamental frequency; and synthesizing the target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal. According to the method and the device, the oral voice frequency is discretized in a text-free mode, and voice signals with expressive force outside the text are captured, so that the voice emotion conversion effect is richer and more real.

Description

Voice emotion conversion method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice emotion conversion.
Background
Emotion and prosody are a fusion of many factors in speech, such as side language information, intonation, accent, and style. In speech conversion, emotion prosody is modeled with the aim of giving the model the ability to choose a speaking style appropriate for a given context. Prosodic styles are difficult to define accurately, but the styles contain rich information, such as intent and emotion, and affect the speaker's choice of speech and mood.
Existing emotion generation or emotion conversion techniques have difficulty producing convincing results because they can only deal with a subset or aspect of these problems. The emotion conversion method based on the signal is mainly focused on the parameter for manipulating the voice signal, and can only solve the changes of the voice and rhythm layers. The current modeling method is difficult to model non-text related signals such as speech pauses, laughter, and mouth-like sounds in emotion voices, so that voice signals with expressive power outside texts are difficult to capture, and the voice emotion conversion effect is not ideal.
Disclosure of Invention
The voice emotion conversion method aims at solving the technical problem that in the prior art, the emotion conversion effect is not ideal due to the fact that a single-plane or partial signal cannot be captured and lost in the voice emotion conversion solution. The application provides a method, a device, equipment and a storage medium for voice emotion conversion, and the main purpose of the method, the device, the equipment and the storage medium is to improve voice emotion conversion effect.
In order to achieve the above object, the present application provides a method for converting speech emotion, which is applied to a speech emotion conversion system, and the method includes:
encoding an input original voice signal to obtain a first discrete pronunciation unit sequence with original emotion;
converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to the target emotion label, wherein the second discrete pronunciation unit sequence and the first discrete pronunciation unit sequence have the same vocabulary content;
based on the target emotion label, performing prosody characteristic prediction on the second discrete pronunciation unit sequence to obtain predicted prosody characteristics, wherein the predicted prosody characteristics comprise predicted duration and predicted fundamental frequency;
and synthesizing the target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal.
In addition, in order to achieve the above object, the present application further provides a device for speech emotion conversion, which includes:
the coding module is used for coding the input original voice signal to obtain a first discrete pronunciation unit sequence with original emotion;
The emotion conversion module is used for converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to the target emotion label, wherein the second discrete pronunciation unit sequence and the first discrete pronunciation unit sequence have the same vocabulary content;
the prediction module is used for predicting the prosody characteristics of the second discrete pronunciation unit sequence based on the target emotion label to obtain predicted prosody characteristics, wherein the predicted prosody characteristics comprise predicted duration and predicted fundamental frequency;
and the synthesis module is used for synthesizing the target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal.
To achieve the above object, the present application further provides a computer device including a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, the processor executing the steps of the method for speech emotion conversion as in any of the preceding claims.
To achieve the above object, the present application further provides a computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor, cause the processor to perform the steps of the method for speech emotion conversion as described in any of the preceding claims.
The method, the device, the equipment and the storage medium for converting the voice emotion break through the dependence on the text in the prior art, are not limited to learning speaking contents by written text, learn speaking contents and emotion rhythms in original audio signals by discrete pronunciation units, capture voice signals with expressive power outside the text, convert the discrete pronunciation units of the original emotion into the discrete pronunciation units with target emotion, and finally splice and join the discrete pronunciation unit sequences with the target emotion into the voice signals with the target emotion according to predicted rhythm characteristics. The method and the device realize discretization of the oral voice frequency in a text-free mode, capture the voice signals with expressive force outside the text, and enable the voice emotion conversion effect to be richer and more vivid.
Drawings
FIG. 1 is a flowchart illustrating a method for speech emotion conversion according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating an apparatus for speech emotion conversion in an embodiment of the present application;
fig. 3 is a block diagram showing an internal structure of a computer device according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In the daily communication process, people often use non-speech signals such as intonation, pause, accent, rhythm, laugh, crying, croup, and tucking to enhance the interactive effect of the dialogue. Such as happy, angry, lost, drowsy, say the same sentence, while the content is the same, the perceived sound must be very different. It follows that emotion in a speech signal has a strong relationship with non-speech signals in it.
FIG. 1 is a flowchart illustrating a method for speech emotion conversion according to an embodiment of the present application. The method is applied to a voice emotion conversion system for example. The execution duration estimation model training method comprises the following steps S100-S400.
S100: and encoding the input original voice signal to obtain a first discrete pronunciation unit sequence with the original emotion.
Specifically, the text NLP technology encodes the input original speech signal, i.e., audio, to obtain a first discrete pronunciation unit sequence, which is equivalent to decomposing and converting the audio, which is a continuous value sequence, into a discrete value sequence, where the first discrete pronunciation unit sequence includes a plurality of first discrete pronunciation units. The first sequence of discrete pronunciation units has an original emotion of the original speech signal.
The use of a sequence of discrete units to represent speech is intended to capture non-verbal utterances, as well as to facilitate better modeling and sampling.
S200: and converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to the target emotion label, wherein the second discrete pronunciation unit sequence and the first discrete pronunciation unit sequence have the same vocabulary content.
Specifically, the target emotion tag is used for representing the target emotion, and the first discrete pronunciation unit sequence with the original emotion is converted or translated into the second discrete pronunciation unit sequence with the target emotion on the premise that the vocabulary content of the voice is kept unchanged. The specific conversion process may be to delete and/or replace a part of the first discrete pronunciation units, and add or insert some new discrete pronunciation units, so that the second discrete pronunciation unit sequence finally obtained has the target emotion. The second discrete pronunciation unit and the first discrete pronunciation unit keep the vocabulary content unchanged, and the emotion is changed from the angle of changing the sequence of the discrete pronunciation units.
The second discrete sound units included in the second sequence of discrete sound units are not identical to the first discrete sound units included in the first sequence of discrete sound units, and may be identical in number or may be different.
This step achieves the goal of modifying the first discrete sequence of pronunciation units by converting the first sequence of pronunciation units from an original emotion to a target emotion.
S300: and carrying out prosody characteristic prediction on the second discrete pronunciation unit sequence based on the target emotion label to obtain predicted prosody characteristics, wherein the predicted prosody characteristics comprise predicted duration and predicted fundamental frequency.
Specifically, the phoneme duration directly affects the pronunciation length and the overall prosody in speech. The fundamental frequency, pitch or F0, is associated with the perceived tone, intonation. The second discrete sound unit sequence is different from the first discrete sound unit sequence, so that prosodic feature prediction is required for each of the newly generated second discrete sound unit sequences. In the prosodic feature prediction process, a target emotion label is required to be used as one of prediction conditions to influence the prosodic feature prediction result.
The predicted prosody characteristic includes a sub-predicted duration and a sub-predicted fundamental frequency for each of the second discrete phonetic units.
Prosody is a feature affecting emotion, and this step alters emotion of speech to be synthesized by re-predicting prosody features of the second discrete phonetic units.
S400: and synthesizing the target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal.
In particular, in order to ensure that the timbre of the utterance is unchanged, i.e. still the timbre of the speaker in the original speech signal, a speaker identity tag needs to be entered. Both speaker identity tags and target emotion tags belong to one feature. And splicing the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal through the vocoder to obtain the voice signal with the target emotion, wherein the voice signal is a continuous signal. Thereby realizing the conversion of the perceived emotion of the voice corpus, namely the voice emotion (speech emotion conversion) on the premise of retaining the vocabulary content and the identity of the speaker.
The embodiment breaks through the dependence of the traditional text, is not limited to learning speaking content by written text, but learns speaking content and emotion rhythm in an original audio signal by discrete pronunciation units, captures a voice signal with expressive force outside the text, converts the discrete pronunciation units of the original emotion into discrete pronunciation units with target emotion, and finally carries out splicing on the discrete pronunciation unit sequence with the target emotion according to predicted rhythm characteristics to obtain the voice signal with the target emotion. According to the embodiment, the discretization is carried out on the spoken voice frequency in a text-free mode, the voice signals with expressive force outside the text are captured, the word content of the voice signals and the tone of a speaker are reserved, and meanwhile, the text-free voice emotion conversion is carried out based on the decomposed discrete voice representation, so that the voice emotion conversion effect is richer, more real and natural.
In one embodiment, the speech emotion conversion system includes a speech pre-training model;
the step S100 specifically includes:
sampling an input original voice signal by utilizing a voice pre-training model to obtain a spectrum characteristic representation;
the spectral feature representation is digitized, and a first sequence of discrete pronunciation units is obtained from the obtained plurality of first discrete pronunciation units.
Specifically, the first sequence of discrete pronunciation units includes a plurality of first discrete pronunciation units.
The original speech signal is a speech waveform. The speech pre-training model is obtained through self-supervision or non-supervision learning training, and specifically may be, for example, a wav2vec 2.0 model or a hubert model, which is not limited thereto.
Self-supervised learning (Self-supervised Learning): the method is a machine learning method for directly mining self-supervision information from large-scale unsupervised data to perform supervised learning and training (which can be regarded as a special case of unsupervised learning), and the self-supervision learning needs a label, but the label is not from manual labeling but from the data itself. The self-supervision learning method comprises the following steps: context-based, timing-based, contrast-based, etc.
Unsupervised learning (Unsupervised Learning): the machine learning method is used for learning a prediction model from unlabeled data, and is essentially used for learning statistical rules or potential structures in the data. An unsupervised learning method: clustering, K-means, PCA, etc.
The wav2vec 2.0 model encodes the original audio into a sequence of frame features, which are then converted into corresponding discrete features. The sequence of frame features is a spectral feature representation.
The HuBERT model is a self-supervising model for continuous audio signal masking prediction, similar to BERT. HuBERT obtains training targets by K-means clustering on MFCC features, which are spectral feature representations, or HuBERT features. The HuBERT model adopts an iterative training mode, the first iteration of the HuBERT model clusters on the MFCC characteristics, the second iteration clusters on the middle layer characteristics, namely the HuBERT characteristics, of the HuBERT model obtained by the first iteration, and the like. HuBERT references the mask language model loss function from BERT and trains the model using the transform to predict the discrete ids of the occluded positions. HuBERT applies an iterative form to generate training metrics, i.e., discrete ids for each frame. K-means clustering is performed on the MFCC features of the voice to generate discrete ids for learning the first generation HuBERT model, and then the input schematic of the trained previous generation model is clustered and new ids are generated for learning the next round. The hubert model converts continuous speech signals into discrete labels by a K-means algorithm and models the discrete labels as an index. The K-means algorithm is a clustering algorithm in unsupervised learning. The original speech signal can be discretized into a plurality of different types of discrete pronunciation units by the hubert model.
The voice pre-training model is trained by using a large amount of unsupervised data and can be applied to various downstream tasks. Discrete representations are extracted from voice signals by using a pre-trained voice pre-training model, so that the efficiency and accuracy of discrete feature extraction can be greatly improved, and the tedious work of sample collection, labeling and model training is reduced.
In one embodiment, the first sequence of discrete pronunciation units comprises a linguistic pronunciation unit and a first non-linguistic pronunciation unit, and the speech emotion conversion system comprises an emotion conversion model;
the step S200 specifically includes:
according to the target emotion labels, converting a first non-language pronunciation unit with original emotion in a first discrete pronunciation unit sequence into a second non-language pronunciation unit with target emotion through conversion processing by using an emotion conversion model, and meanwhile, reserving the language pronunciation unit to obtain a second discrete pronunciation unit sequence.
Specifically, the language pronunciation units are discrete pronunciation units corresponding to text or vocabulary content, for example, "hello", "thank you", etc. The non-verbal pronunciation units are discrete pronunciation units that do not correspond to text or lexical content, e.g., crying, laughter, etc.
Discrete pronunciation units containing rich information are extracted from the voice signal, and language discrete pronunciation units and non-language discrete pronunciation units are captured. The non-linguistic pronunciation, i.e., the non-textual pronunciation, carries rich emotion signals, so this embodiment changes emotion by changing the non-linguistic pronunciation.
The first sequence of discrete pronunciation units having the target emotion is judged from an overall perspective, but it is possible that some of the first discrete pronunciation units have target emotion tags and some of the first discrete pronunciation units do not have target emotion tags.
The language pronunciation unit and the first non-language pronunciation unit are different first discrete pronunciation units, at least one of the first non-language pronunciation units contained in the first discrete pronunciation unit sequence is changed by executing at least one operation of deleting, replacing and inserting the first non-language pronunciation unit according to the object emotion label through the emotion conversion model, and therefore the second non-language pronunciation unit is obtained. The second non-linguistic pronunciation unit may include part or none or all of the first non-linguistic pronunciation unit, and may include the new non-linguistic pronunciation unit inserted, which is determined according to the conversion process of the emotion conversion model. The new non-language pronunciation unit is a pronunciation unit with a target emotion label, and the first non-language pronunciation unit is a pronunciation unit with an original emotion label.
The model converts the input audio to the target emotion, which is equivalent to an end-to-end sequence translation problem, and can be easier to convert emotion by inserting, deleting and replacing some non-language audio signals, namely non-language pronunciation units.
According to the embodiment, emotion is changed by deleting, inserting and replacing at least one of the non-language pronunciation units, so that emotion conversion of non-text non-language signals in the voice is realized. The model can not only find the frequency spectrum and parameter variation of the signal, but also model the non-language sound production.
In one embodiment, the conversion process includes insert, replace, and delete;
converting a first non-linguistic pronunciation unit having an original emotion in a first sequence of discrete pronunciation units into a second non-linguistic pronunciation unit having a target emotion by a conversion process using an emotion conversion model, comprising:
carrying out emotion recognition and identification on the first discrete pronunciation units by using an emotion conversion model to obtain emotion identification labels of each first non-language pronunciation unit,
the method comprises the steps of obtaining a target non-language pronunciation unit serving as a target emotion label, deleting a first non-language pronunciation unit serving as an original emotion label of the emotion label, replacing the emotion label serving as the original emotion label by using the target non-language pronunciation unit, and inserting the emotion label into the target non-language pronunciation unit to obtain a second non-language pronunciation unit with target emotion.
Specifically, the emotion conversion model has an emotion identification or emotion recognition function and an emotion conversion function. The emotion conversion model can be constructed by adopting a sequence-to-sequence Transformer model structure. The emotion markup tags generated can be used to control speech synthesis in new ways, such as changing speed and emotion style, which are independent of the lexical content of the speech.
In model training, training data are discrete pronunciation units of sample audio data with various emotion styles and moods, the sample audio data have no specific emotion classification labels (such as happiness, anger, fun, etc.), and the model is required to learn labels, namely soft labels, autonomously. Because the voice has various emotions, laughing sounds, crying sounds, tucking sounds and the like, the emotion cannot be classified manually, and the model is adopted to perform autonomous learning on high-dimensional features in data to obtain various emotion identification tags, so that a model capable of representing various emotion styles and moods can be trained.
The emotion conversion model structure can internally generate interpretable soft labels from the data by itself, and the labels can be used for expressing various types of control and transmission tasks, and can improve long sentence synthesis expression. The emotion style identification can be directly applied to noisy, unlabeled data, thereby implementing a highly scalable speech conversion system.
More specifically, the emotion conversion model can classify discrete feature representations in the speech, namely discrete pronunciation units, through the learned label information, and potentially divide and identify emotion style signals, namely first discrete pronunciation units. After the emotion conversion model is matched with a target non-language pronunciation unit corresponding to the target emotion label, the target non-language pronunciation unit is utilized to replace or insert the first non-language pronunciation unit of the original emotion label into the first discrete pronunciation unit or delete the first non-language pronunciation unit of the original emotion label, so that the first discrete pronunciation unit sequence is changed in a deleting, replacing and inserting mode to obtain a second discrete pronunciation unit sequence, wherein the target non-language pronunciation unit is a discrete pronunciation unit which is learned and classified in the training process of the emotion conversion model.
The present embodiment achieves emotion conversion of discrete pronunciation units by a learnable mechanism of inserting/deleting/replacing non-speech sounds while preserving vocabulary contents (e.g., deleting non-language pronunciation units corresponding to laughter and inserting non-language pronunciation units corresponding to cry while preserving message contents). The current speech synthesis mode is difficult to process emotion expression of long sentences, and meanwhile, the emotion modeling effect of words in a sentence is poor. According to the embodiment, through modeling of the voice discrete units and emotion embedded vector soft labels, better effects can be achieved on fine-grained modeling and long-distance modeling.
In one embodiment, the speech emotion conversion system includes a duration prediction model, and the second sequence of discrete pronunciation units includes a plurality of second discrete pronunciation units;
the step S300 specifically includes:
based on the target emotion labels, predicting the duration of each second discrete pronunciation unit by using a duration prediction model, and obtaining the predicted duration of the second discrete pronunciation unit sequence according to the obtained sub-predicted duration of the second discrete pronunciation units.
Specifically, the duration prediction model is specifically a CNN model, and the sub-predicted duration of each second discrete pronunciation unit is obtained by using a convolutional neural network CNN.
In the training process, a duration prediction model learns the mapping between the discrete pronunciation units and the duration; and training a duration prediction model by using a mechanism of minimizing a mean square error MSE between the output predicted duration and the real duration by using the discrete pronunciation units corresponding to the sample voice data as input and using the real duration of the discrete pronunciation units as a supervision tag.
Wherein the real time length of the discrete sound units can be obtained by means of an external tool, such as FFmpeg or libros a.
In one embodiment, the speech emotion translation system further includes a fundamental frequency prediction model,
The step S300 specifically further includes:
and taking the second discrete pronunciation unit sequence carrying the sub-prediction time length of all the second discrete pronunciation units and the target emotion labels as inputs of a fundamental frequency prediction model, predicting the fundamental frequency of each second discrete pronunciation unit by using the fundamental frequency prediction model, and obtaining the predicted fundamental frequency of the second discrete pronunciation unit sequence according to the obtained sub-fundamental frequency of the second discrete pronunciation unit.
In particular, the fundamental frequency estimation model may use a CNN model, the fundamental frequency being f0. The fundamental frequency estimation model learns the mapping between the discrete pronunciation units and the fundamental frequency; the discrete pronunciation units corresponding to the sample voice data are used as input, the real fundamental frequency of the discrete pronunciation units is used as a supervision tag, and a fundamental frequency estimation model is trained by utilizing a mechanism of binary cross entropy BCE between the predicted fundamental frequency and the real fundamental frequency which are minimized to be output.
Wherein the real fundamental frequency of the discrete pronunciation unit can be extracted by using YAAPT algorithm or YIN algorithm.
Because the mapping relation between the discrete pronunciation units and the prosodic features is related to the target emotion, the target emotion label is required to be used as input of a fundamental frequency prediction model, so that the fundamental frequency prediction model can be used for predicting the fundamental frequency under the condition of the target emotion label, and the accuracy is higher.
In one embodiment, a speech emotion conversion system includes a vocoder;
the step S400 specifically includes:
performing expansion processing on the second discrete pronunciation unit sequence by using the predicted time length to obtain a predicted discrete pronunciation unit sequence;
taking the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label of the original voice signal as the input of the vocoder;
and splicing the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label along a time axis by using the vocoder to obtain the target emotion voice signal.
Specifically, the vocoder may use a neural vocoder such as WaveNet or HiFi-GAN. The structure of the HiFi-GAN consists of a generator G and a set of discriminators D. And splicing the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label along a time axis by using a generator G, and sending the spliced signals into a series of convolution layers outputting one-dimensional signals to obtain the target emotion voice signal.
According to the emotion voice conversion scheme, voice self-supervision characterization learning and non-supervision learning are combined, continuous voice signals are encoded into discrete pronunciation units based on a voice pre-training model obtained by training a large amount of non-label data, and the problems that emotion voice synthesis and conversion are small in label data amount, emotion categories are difficult to divide, non-text signals are difficult to model and the like are solved. Emotion conversion is carried out on the non-text non-language pronunciation units through a deleting, replacing and inserting mechanism, emotion conversion is further carried out on the discrete pronunciation units through prosodic feature prediction, and efficient and abundant emotion conversion and speech synthesis are carried out on the speech while the vocabulary content and the voice color of a speaker are ensured to be unchanged.
Compared with the prior art, the emotion conversion is performed by a text method, because intonation and non-language expression are lost in the text, and the expressive force of some spoken language is lost, the emotion conversion effect is poor. According to the method, the discrete units can be learned from the original audio through the self-supervision speech representation method, dependence on texts can be eliminated, rich semantic and speech information is conveyed on phonemes, and therefore speech emotion conversion is more natural and smooth.
Fig. 2 is a block diagram of a voice emotion conversion device according to an embodiment of the present application. Referring to fig. 2, the apparatus includes:
the encoding module 100 is configured to perform signal encoding on an input original voice to obtain a first discrete pronunciation unit sequence with an original emotion;
the emotion conversion module 200 is configured to convert the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with the target emotion according to the target emotion tag, where the second discrete pronunciation unit sequence and the first discrete pronunciation unit sequence have the same vocabulary content;
the prediction module 300 is configured to predict prosodic features of the second discrete pronunciation unit sequence based on the target emotion label, so as to obtain predicted prosodic features, where the predicted prosodic features include a predicted duration and a predicted fundamental frequency;
The synthesizing module 400 is configured to synthesize a target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic, and the speaker identity label of the original voice signal.
In one embodiment, the speech emotion conversion system includes a speech pre-training model;
the encoding module 100 specifically includes:
the sampling module is used for sampling an input original voice signal by utilizing the voice pre-training model to obtain a spectrum characteristic representation;
and the discrete module is used for digitizing the spectrum characteristic representation and obtaining a first discrete pronunciation unit sequence according to the obtained plurality of first discrete pronunciation units.
In one embodiment, the first sequence of discrete pronunciation units comprises a linguistic pronunciation unit and a first non-linguistic pronunciation unit, and the speech emotion conversion system comprises an emotion conversion model;
the emotion conversion module 200 is specifically configured to convert, according to the target emotion tag, a first non-language pronunciation unit with an original emotion in the first discrete pronunciation unit sequence into a second non-language pronunciation unit with the target emotion through conversion processing by using an emotion conversion model, and meanwhile, reserve the language pronunciation unit to obtain a second discrete pronunciation unit sequence.
In one embodiment, the conversion process includes insert, replace, and delete;
the emotion conversion module 200 specifically includes:
the emotion recognition module is used for carrying out emotion recognition and identification on the first discrete pronunciation units by utilizing the emotion conversion model to obtain emotion identification labels of each first non-language pronunciation unit,
the conversion module is used for acquiring a target non-language pronunciation unit serving as a target emotion label, deleting a first non-language pronunciation unit serving as an original emotion label, replacing the emotion label serving as the original emotion label by using the target non-language pronunciation unit, and inserting at least one operation in the target non-language pronunciation unit to obtain a second non-language pronunciation unit with target emotion.
In one embodiment, the speech emotion conversion system includes a duration prediction model, and the second sequence of discrete pronunciation units includes a plurality of second discrete pronunciation units;
the prediction module 300 specifically includes:
the duration prediction module is used for predicting the duration of each second discrete pronunciation unit by using a duration prediction model based on the target emotion label, and obtaining the predicted duration of the second discrete pronunciation unit sequence according to the obtained sub-predicted duration of the second discrete pronunciation unit.
In one embodiment, the speech emotion translation system further includes a fundamental frequency prediction model,
the prediction module 300 further includes:
the fundamental frequency prediction module is used for taking a second discrete pronunciation unit sequence carrying sub-prediction time lengths of all the second discrete pronunciation units and a target emotion label as input of a fundamental frequency prediction model, predicting the fundamental frequency of each second discrete pronunciation unit by using the fundamental frequency prediction model, and obtaining the predicted fundamental frequency of the second discrete pronunciation unit sequence according to the obtained sub-fundamental frequency of the second discrete pronunciation unit.
In one embodiment, a speech emotion conversion system includes a vocoder;
the synthesis module 400 specifically includes:
the expansion module is used for carrying out expansion processing on the second discrete pronunciation unit sequence by utilizing the predicted time length to obtain a predicted discrete pronunciation unit sequence;
the input module is used for taking the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label of the original voice signal as the input of the vocoder;
and the vocoder module is used for splicing the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label along the time axis by utilizing the vocoder to obtain the target emotion voice signal.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
The meaning of "first" and "second" in the above modules/units is merely to distinguish different modules/units, and is not used to limit which module/unit has higher priority or other limiting meaning. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or modules that may not be expressly listed or inherent to such process, method, article, or apparatus, and the partitioning of such modules by means of such elements is only a logical partitioning and may be implemented in a practical application.
For specific limitations on the means for speech emotion conversion, reference may be made to the limitations of the method for speech emotion conversion hereinabove, and no further description is given here. The above-mentioned various modules in the speech emotion conversion device may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Fig. 3 is a block diagram showing an internal structure of a computer device according to an embodiment of the present application. The computer device may specifically be a speech emotion conversion system. As shown in fig. 3, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory includes a storage medium and an internal memory. The storage medium may be a nonvolatile storage medium or a volatile storage medium. The storage medium stores an operating system and may also store computer readable instructions that, when executed by the processor, cause the processor to implement a method of speech emotion conversion. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in the storage medium. The internal memory may also have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a method of speech emotion conversion. The network interface of the computer device is for communicating with an external server via a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
In one embodiment, a computer device is provided that includes a memory, a processor, and computer readable instructions (e.g., a computer program) stored on the memory and executable on the processor, which when executed by the processor performs steps of the method of speech emotion translation in the above embodiments, such as steps S100-S400 shown in fig. 1, and other extensions of the method and extensions of related steps. Alternatively, the processor executes computer readable instructions to implement the functions of the modules/units of the apparatus for speech emotion conversion in the above-described embodiments, such as the functions of modules 100 through 400 shown in fig. 2. In order to avoid repetition, a description thereof is omitted.
The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being a control center of the computer device, and the various interfaces and lines connecting the various parts of the overall computer device.
The memory may be used to store computer-readable instructions and/or modules that, by being executed or executed by the processor, implement various functions of the computer device by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.
The memory may be integrated with the processor or may be separate from the processor.
It will be appreciated by those skilled in the art that the structure shown in fig. 3 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer readable storage medium is provided, on which computer readable instructions are stored, which when executed by a processor, implement the steps of the method of speech emotion conversion in the above embodiments, such as steps S100-S400 shown in fig. 1, and other extensions of the method and related steps. Alternatively, the computer readable instructions, when executed by a processor, implement the functions of the modules/units of the apparatus for speech emotion conversion in the above-described embodiments, such as the functions of modules 100 through 400 shown in fig. 2. In order to avoid repetition, a description thereof is omitted.
Those of ordinary skill in the art will appreciate that implementing all or part of the processes of the above described embodiments may be accomplished by instructing the associated hardware by way of computer readable instructions stored in a computer readable storage medium, which when executed, may comprise processes of embodiments of the above described methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. A method for speech emotion conversion, applied to a speech emotion conversion system, the method comprising:
encoding an input original voice signal to obtain a first discrete pronunciation unit sequence with original emotion;
converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to the target emotion label, wherein the second discrete pronunciation unit sequence and the first discrete pronunciation unit sequence have the same vocabulary content;
based on the target emotion label, performing prosodic feature prediction on the second discrete pronunciation unit sequence to obtain predicted prosodic features, wherein the predicted prosodic features comprise predicted duration and predicted fundamental frequency;
and synthesizing a target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal.
2. The method of claim 1, wherein the speech emotion conversion system comprises a speech pre-training model;
the encoding of the input original speech signal to obtain a first discrete pronunciation unit sequence with original emotion comprises the following steps:
sampling an input original voice signal by utilizing the voice pre-training model to obtain a spectrum characteristic representation;
and digitizing the spectrum characteristic representation, and obtaining a first discrete pronunciation unit sequence according to the obtained plurality of first discrete pronunciation units.
3. The method of claim 1, wherein the first sequence of discrete pronunciation units comprises a linguistic pronunciation unit and a first non-linguistic pronunciation unit, and wherein the speech emotion conversion system comprises an emotion conversion model;
the converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to the target emotion label comprises the following steps:
according to the target emotion label, converting a first non-language pronunciation unit with original emotion in the first discrete pronunciation unit sequence into a second non-language pronunciation unit with target emotion through conversion processing by using the emotion conversion model, and simultaneously reserving the language pronunciation unit to obtain a second discrete pronunciation unit sequence.
4. A method according to claim 3, wherein the conversion process includes insertion, substitution and deletion;
the converting, by using the emotion conversion model, the first non-language pronunciation unit with the original emotion in the first discrete pronunciation unit sequence into the second non-language pronunciation unit with the target emotion through conversion processing, including:
carrying out emotion recognition and identification on the first discrete pronunciation units by utilizing the emotion conversion model to obtain emotion identification labels of each first non-language pronunciation unit,
acquiring a target non-language pronunciation unit serving as a target emotion label, deleting a first non-language pronunciation unit serving as an original emotion label of the emotion label, replacing the emotion label serving as the original emotion label by using the target non-language pronunciation unit, and inserting at least one operation in the target non-language pronunciation unit to obtain a second non-language pronunciation unit with target emotion.
5. The method of claim 1, wherein the speech emotion conversion system includes a duration prediction model, and the second sequence of discrete pronunciation units includes a plurality of second discrete pronunciation units;
Based on the target emotion label, performing prosodic feature prediction on the second discrete pronunciation unit sequence to obtain predicted prosodic features, including:
and predicting the duration of each second discrete pronunciation unit by using the duration prediction model based on the target emotion label, and obtaining the predicted duration of the second discrete pronunciation unit sequence according to the obtained sub-predicted duration of the second discrete pronunciation unit.
6. The method of claim 5, wherein the speech emotion translation system further comprises a fundamental frequency prediction model,
based on the target emotion label, performing prosodic feature prediction on the second discrete pronunciation unit sequence to obtain predicted prosodic features, and further comprising:
and taking a second discrete pronunciation unit sequence carrying sub-prediction time lengths of all the second discrete pronunciation units and the target emotion labels as inputs of a fundamental frequency prediction model, predicting fundamental frequency of each second discrete pronunciation unit by using the fundamental frequency prediction model, and obtaining predicted fundamental frequency of the second discrete pronunciation unit sequence according to the obtained sub-fundamental frequency of the second discrete pronunciation unit.
7. The method of claim 1, wherein the speech emotion conversion system comprises a vocoder;
The synthesizing the target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal comprises the following steps:
performing expansion processing on the second discrete pronunciation unit sequence by using the predicted time length to obtain a predicted discrete pronunciation unit sequence;
taking the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label of the original voice signal as inputs of the vocoder;
and splicing the predicted discrete pronunciation unit sequence, the predicted fundamental frequency, the target emotion label and the speaker identity label along a time axis by using the vocoder to obtain a target emotion voice signal.
8. An apparatus for speech emotion conversion, said apparatus comprising:
the coding module is used for coding the input original voice signal to obtain a first discrete pronunciation unit sequence with original emotion;
the emotion conversion module is used for converting the first discrete pronunciation unit sequence into a second discrete pronunciation unit sequence with target emotion according to a target emotion label, wherein the second discrete pronunciation unit sequence and the first discrete pronunciation unit sequence have the same vocabulary content;
The prediction module is used for predicting the prosody characteristics of the second discrete pronunciation unit sequence based on the target emotion label to obtain predicted prosody characteristics, wherein the predicted prosody characteristics comprise predicted duration and predicted fundamental frequency;
and the synthesis module is used for synthesizing the target emotion voice signal according to the target emotion label, the second discrete pronunciation unit sequence, the predicted prosody characteristic and the speaker identity label of the original voice signal.
9. A computer device comprising a memory, a processor and computer readable instructions stored on the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, performs the steps of the method of speech emotion conversion of any of claims 1-7.
10. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor cause the processor to perform the steps of the method of speech emotion conversion of any of claims 1-7.
CN202310152979.3A 2023-02-16 2023-02-16 Voice emotion conversion method, device, equipment and storage medium Pending CN116092478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310152979.3A CN116092478A (en) 2023-02-16 2023-02-16 Voice emotion conversion method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310152979.3A CN116092478A (en) 2023-02-16 2023-02-16 Voice emotion conversion method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116092478A true CN116092478A (en) 2023-05-09

Family

ID=86212014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310152979.3A Pending CN116092478A (en) 2023-02-16 2023-02-16 Voice emotion conversion method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116092478A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117894294A (en) * 2024-03-14 2024-04-16 暗物智能科技(广州)有限公司 Personification auxiliary language voice synthesis method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117894294A (en) * 2024-03-14 2024-04-16 暗物智能科技(广州)有限公司 Personification auxiliary language voice synthesis method and system

Similar Documents

Publication Publication Date Title
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
EP4007997B1 (en) Controlling expressivity in end-to-end speech synthesis systems
US20230064749A1 (en) Two-Level Speech Prosody Transfer
KR20190085879A (en) Method of multilingual text-to-speech synthesis
Kaur et al. Conventional and contemporary approaches used in text to speech synthesis: A review
Malcangi Text-driven avatars based on artificial neural networks and fuzzy logic
Triantafyllopoulos et al. An overview of affective speech synthesis and conversion in the deep learning era
CN113761841B (en) Method for converting text data into acoustic features
JP2024505076A (en) Generate diverse, natural-looking text-to-speech samples
CN116092478A (en) Voice emotion conversion method, device, equipment and storage medium
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113257225A (en) Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
Ronanki Prosody generation for text-to-speech synthesis
CN113628609A (en) Automatic audio content generation
KR20080011859A (en) Method for predicting sentence-final intonation and text-to-speech system and method based on the same
CN113192483B (en) Method, device, storage medium and equipment for converting text into voice
Gong et al. A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques
Wadhwa et al. Emotionally Intelligent Image to Audible Prompt Generation for Visually Challenged People Using AI
Coto-Jiménez et al. Hidden Markov Models for artificial voice production and accent modification
Ghafoor et al. Isolated Words Speech Recognition System for Brahvi Language using Recurrent Neural Network
Kokate et al. An Algorithmic Approach to Audio Processing and Emotion Mapping
Dev et al. CTC-Based End-to-End Speech Recognition for Low Resource Language Sanskrit
CN114283781A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN116416966A (en) Text-to-speech synthesis method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination