CN116543780A - Model updating method and device, voice conversion method and device and storage medium - Google Patents

Model updating method and device, voice conversion method and device and storage medium Download PDF

Info

Publication number
CN116543780A
CN116543780A CN202310638552.4A CN202310638552A CN116543780A CN 116543780 A CN116543780 A CN 116543780A CN 202310638552 A CN202310638552 A CN 202310638552A CN 116543780 A CN116543780 A CN 116543780A
Authority
CN
China
Prior art keywords
sample
audio
phoneme
vector
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310638552.4A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
唐怀朕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310638552.4A priority Critical patent/CN116543780A/en
Publication of CN116543780A publication Critical patent/CN116543780A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a model updating method and device, a voice conversion method and device and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring sample voice data; inputting the sample voice data into a neural network model; the method comprises the steps of performing coding processing on sample voice data through a coding network to obtain an initial audio feature vector; index inquiry is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector; performing voice alignment on the initial phoneme feature vector to obtain a sample audio embedding vector; decoding the sample audio embedded vector and the speaking style embedded vector through a decoding network to obtain synthesized voice data; and updating parameters of the neural network model based on the synthesized voice data and the sample voice data to obtain a voice conversion model. The method and the device can improve the accuracy of the model on voice conversion.

Description

Model updating method and device, voice conversion method and device and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for updating a model, a method and apparatus for converting speech, and a storage medium.
Background
Along with the rapid development of artificial intelligence technology, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like, and the service level of enterprise customer service is improved through the functions of intelligent marketing, intelligent collection, content navigation and the like.
Currently, conversation robots are often adopted in financial service scenes such as intelligent customer service, shopping guide and the like to provide corresponding service support for various objects. The conversational speech used by these conversational robots is often generated based on speech conversion.
Speech conversion generally refers to changing the speaking style of a conversation robot from one speaker's speaking style to another without changing the speech content. In the existing voice conversion method, when the neural network model is adopted for voice conversion, the model cannot learn actual speaking contents and style characteristics of a speaker, so that the accuracy of voice conversion is poor, and therefore, how to improve the accuracy of the model on voice conversion becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application mainly aims at providing a training method and device for a voice conversion model, electronic equipment and storage medium, and aims at improving the accuracy of the model on voice conversion.
To achieve the above object, a first aspect of an embodiment of the present application proposes a model updating method, including:
acquiring sample voice data of a sample speaking object;
inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
the sample voice data is coded through the coding network, so that an initial audio feature vector is obtained;
index query is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector;
performing voice alignment on the initial phoneme feature vector to obtain a sample audio embedding vector;
the sample audio embedded vector and the pre-acquired speaking style embedded vector are decoded through the decoding network to obtain synthesized voice data; wherein the speaking style embedding vector includes speaking style information of the sample speaking object;
And updating parameters of the neural network model based on the synthesized voice data and the sample voice data to obtain a voice conversion model.
In some embodiments, the indexing query is performed on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and the extracting of phoneme features is performed on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector, which includes:
dividing the initial audio feature vector to obtain a plurality of audio frame vectors;
index inquiry is carried out on the audio frame vectors based on the reference vectors of the preset codebook to obtain audio frame indexes corresponding to each audio frame vector, wherein the preset codebook comprises reference vectors and audio frame indexes which are in one-to-one correspondence, and mapping relations between each audio frame index and phoneme characteristics;
extracting phoneme features corresponding to the audio frame vectors according to the audio frame indexes and the mapping relation;
and merging the phoneme features of all the audio frame vectors to obtain the initial phoneme feature vector.
In some embodiments, the performing speech alignment on the initial phoneme feature vector to obtain a sample audio embedding vector includes:
Performing duration prediction on the initial phoneme feature vector based on a preset time predictor to obtain a duration sequence of the initial phoneme feature vector;
and carrying out voice alignment on the initial phoneme characteristic vector according to the duration time sequence to obtain the sample audio embedding vector.
In some embodiments, the performing, by the pre-set time predictor, the duration prediction on the initial phoneme feature vector to obtain a duration sequence of the initial phoneme feature vector includes:
dividing the initial phoneme feature vector to obtain a plurality of phoneme feature fragments;
identifying the phoneme characteristic fragments based on a preset time predictor to obtain the phoneme category of the initial phoneme characteristic vector and the number of phonemes of each phoneme category;
and obtaining the duration sequence according to the phoneme category and the phoneme number.
In some embodiments, the performing speech alignment on the initial phoneme feature vector according to the duration sequence to obtain the sample audio embedding vector includes:
embedding the initial phoneme feature vector to obtain an audio text embedded vector;
Dividing the audio text embedded vector according to the duration time sequence to obtain intermediate vectors of each phoneme category, wherein the number of the intermediate vectors is equal to the number of the phonemes;
carrying out mean value calculation on the intermediate vectors to obtain candidate vectors of each phoneme category;
copying the candidate vectors according to the number of the phonemes to obtain target vectors of each phoneme category, wherein the number of the target vectors is equal to the number of the phonemes;
and performing splicing processing on all the target vectors to obtain the sample audio embedded vector.
In some embodiments, before the decoding processing is performed on the sample audio embedded vector and the pre-acquired speaking style embedded vector through the decoding network to obtain the synthesized speech data, the model updating method further includes acquiring the speaking style embedded vector, and specifically includes:
inputting the sample voice data into a preset voiceprint recognition model, wherein the voiceprint recognition model comprises a segmentation layer and a hiding layer;
dividing the sample voice data based on the dividing layer to obtain a plurality of sample voice fragments;
Carrying out style recognition on each sample voice fragment based on the hidden layer to obtain a plurality of initial style embedded vectors;
and carrying out mean value calculation on the plurality of initial style embedded vectors to obtain the speaking style embedded vector.
To achieve the above object, a second aspect of the embodiments of the present application proposes a voice conversion method, including:
acquiring target speaking style information of a target speaking object and original voice data to be processed;
and inputting the original voice data and the target speaking style information into a voice conversion model to perform voice conversion to obtain target voice data, wherein the model updating method of the first aspect of the voice conversion model is obtained.
To achieve the above object, a third aspect of the embodiments of the present application proposes a model updating apparatus, the apparatus including:
the data acquisition module is used for acquiring sample voice data of a sample speaking object;
the input module is used for inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
the coding module is used for coding the sample voice data through the coding network to obtain an initial audio feature vector;
The query module is used for carrying out index query on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and carrying out phoneme feature extraction on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector;
the voice alignment module is used for carrying out voice alignment on the initial phoneme characteristic vector to obtain a sample audio embedding vector;
the decoding module is used for decoding the sample audio embedded vector and the pre-acquired speaking style embedded vector through the decoding network to obtain synthesized voice data; wherein the speaking style embedding vector includes speaking style information of the sample speaking object;
and the model updating module is used for updating parameters of the neural network model based on the synthesized voice data and the sample voice data to obtain a voice conversion model.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor executes the computer program to implement the method described in the first aspect or the method described in the second aspect.
To achieve the above object, a fifth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in the first aspect or the method described in the second aspect.
The model updating method, the voice conversion method, the model updating device, the electronic equipment and the storage medium are used for acquiring sample voice data of a sample speaking object; inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the method comprises the steps of performing coding processing on sample voice data through a coding network to obtain an initial audio feature vector; index inquiry is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector; performing voice alignment on the initial phoneme feature vector to obtain a sample audio embedded vector, and decoding the sample audio embedded vector and a pre-acquired speaking style embedded vector through a decoding network to obtain synthesized voice data, wherein the synthesized voice data can contain voice content and speaking style characteristics which are relatively close to those of the sample audio data; finally, parameter updating is carried out on the neural network model based on the synthesized voice data and the sample voice data, so that the model is more focused on learning the similarity of the synthesized voice data and the sample voice data in voice content and voice style, the updating effect of the model can be effectively improved, the accuracy of voice conversion by the voice conversion model is improved, further, in the process of ensuring intelligent conversations of products, financial products and the like, the synthesized voice expressed by the conversation robot can be more attached to the conversation style preference of a conversation object, conversation communication is carried out by adopting a conversation mode and a conversation style which are more interesting to the conversation object, conversation quality and conversation effectiveness are improved, intelligent voice conversation service can be realized, service quality of clients and customer satisfaction are improved, and therefore business yield is improved.
Drawings
FIG. 1 is a flow chart of a model update method provided by an embodiment of the present application;
fig. 2 is a flowchart of step S104 in fig. 1;
fig. 3 is a flowchart of step S105 in fig. 1;
fig. 4 is a flowchart of step S301 in fig. 3;
fig. 5 is a flowchart of step S302 in fig. 3;
FIG. 6 is another flow chart of a model update method provided by an embodiment of the present application;
FIG. 7 is a flowchart of a voice conversion method according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a model updating device according to an embodiment of the present disclosure;
fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.
Phoneme (Phone): the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme.
Fourier transform: the representation can represent a certain function satisfying a certain condition as a trigonometric function (sine and/or cosine function) or a linear combination of their integrals. In different areas of research, fourier transforms have many different variants, such as continuous fourier transforms and discrete fourier transforms.
Mel-frequency cepstral coefficients (Mel-Frequency Cipstal Coefficients, MFCC): is a set of key coefficients used to create the mel-frequency spectrum. From the segments of the music signal, a set of cepstrum is obtained that is sufficient to represent the music signal, and mel-frequency cepstrum coefficients are cepstrum (i.e., the spectrum) derived from the cepstrum. Unlike the general cepstrum, the biggest feature of mel-frequency cepstrum is that the frequency bands on mel-frequency spectrum are uniformly distributed on mel scale, that is, such frequency bands are closer to human nonlinear auditory System (Audio System) than the general observed, linear cepstrum representation method. For example: in audio compression techniques, mel-frequency cepstrum is often used for processing.
Encoding (Encoder): the input sequence is converted into a vector of fixed length.
Decoding (Decoder): the fixed vector generated before is converted into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.
Along with the rapid development of artificial intelligence technology, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like, and the service level of enterprise customer service is improved through the functions of intelligent marketing, intelligent collection, content navigation and the like.
Currently, conversation robots are often adopted in financial service scenes such as intelligent customer service, shopping guide and the like to provide corresponding service support for various objects. The conversational speech used by these conversational robots is often generated based on speech conversion.
Taking an insurance service robot as an example, it is often necessary to fuse the description text of an insurance product with the speaking style of a fixed object to generate a description voice of the insurance product by the fixed object. When the insurance service robot dialogues with some interested objects, the description voice is automatically invoked to introduce insurance products for the objects. When the insurance service object robot needs to converse with some newly added potential objects, the existing speaking styles of the description voices can be replaced, and the speaking styles in the description voices are replaced by the speaking styles of the objects A and B under the condition of not changing the voice content, so that the description voices after voice conversion can accord with the conversation preference of the newly added potential objects.
Speech conversion generally refers to the exchange of the speaking style of a conversation robot from one speaker's style to another without changing the speech content. In the existing voice conversion method, when the neural network model is adopted for voice conversion, the model cannot learn actual speaking contents and style characteristics of a speaker, so that the accuracy of voice conversion is poor, and therefore, how to improve the accuracy of the model on voice conversion becomes a technical problem to be solved urgently.
Based on this, the embodiment of the application provides a model updating method, a voice conversion method, a model updating device, electronic equipment and a storage medium, which aim to improve the accuracy of model to voice conversion.
The model updating method, the voice conversion method, the model updating device, the electronic device and the storage medium provided in the embodiments of the present application are specifically described through the following embodiments, and the model updating method in the embodiments of the present application is described first.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the application provides a model updating method and a voice conversion method, and relates to the technical field of artificial intelligence. The model updating method and the voice conversion method provided by the embodiment of the application can be applied to the terminal, can be applied to the server side, and can also be software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the model updating method and the voice conversion method, but is not limited to the above form.
It should be noted that, in each specific embodiment of the present application, when related processing is required according to data related to user identity or characteristics, such as user information, user behavior data, user voice data, user history data, and user location information, the permission or consent of the user is obtained first, and the collection, use, processing, and the like of these data all comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Fig. 1 is an optional flowchart of a method for updating a model according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107.
Step S101, sample speech data of a sample speaking object is obtained;
step S102, inputting sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
step S103, coding the sample voice data through a coding network to obtain an initial audio feature vector;
step S104, carrying out index query on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and carrying out phoneme feature extraction on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector;
step S105, performing voice alignment on the initial phoneme feature vector to obtain a sample audio embedding vector;
step S106, decoding the sample audio embedded vector and the pre-acquired speaking style embedded vector through a decoding network to obtain synthesized voice data; wherein the speaking style embedding vector includes speaking style information of the sample speaking object;
and step S107, updating parameters of the neural network model based on the synthesized voice data and the sample voice data to obtain a voice conversion model.
Steps S101 to S107 illustrated in the embodiment of the present application are performed by acquiring sample speech data of a sample speaking object; inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the method comprises the steps of performing coding processing on sample voice data through a coding network to obtain an initial audio feature vector; index inquiry is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector; performing voice alignment on the initial phoneme feature vector to obtain a sample audio embedded vector, and decoding the sample audio embedded vector and a pre-acquired speaking style embedded vector through a decoding network to obtain synthesized voice data, wherein the synthesized voice data can contain voice content and speaking style characteristics which are relatively close to those of the sample audio data; finally, parameter updating is carried out on the neural network model based on the synthesized voice data and the sample voice data, so that the model is more focused on learning the similarity of the synthesized voice data and the sample voice data in voice content and voice style, the updating effect of the model can be effectively improved, and the accuracy of voice conversion by the voice conversion model is improved.
In step S101 of some embodiments, data may be crawled in a targeted manner after a web crawler is written and a data source is set, so as to obtain sample voice data of a sample speaking object, where the data source may be various types of network platforms, social media may also be some specific audio databases, etc., the sample voice data may be musical materials, lecture reports, chat dialogs, etc. of the sample speaking object, and the sample voice data includes sample audio content and sample acoustic features, and the sample acoustic features include tone information, pitch information, etc. of the sample speaking object.
For example, in the field of financial transactions, the sample voice data is audio data containing conversations commonly used in the financial field, and in the security promotion scenario, the sample voice data is audio data containing descriptions of the risk, cost, applicable population, etc. of a certain security product.
In a specific example, the sample audio content of the sample voice data is "consulting credit card problem", "credit card with high preference amount", "transacting deposit business", or the like, and the sample acoustic feature is "speech speed is normal".
In step S102 of some embodiments, the sample speech data is input into a preset neural network model, where the neural network model includes an encoding network and a decoding network, the encoding network is mainly used for performing speech reconstruction and speech alignment on the input speech data, extracting phoneme features in the input speech data, and performing length adjustment on the phoneme features according to the phoneme features and acquired phoneme duration so that the phoneme features and the speech data can be aligned, the decoding network is mainly used for performing joint decoupling on the aligned phoneme features and preset reference speaking style features, converting a speech style of the input speech data into a reference speaking style, so that an original speaking object is converted into a reference speaking object without changing speech content of the input speech data, that is, the phoneme features of the input speech data and the reference speaking features of the reference speaking object are fused to form new speech data, and the trained neural network model can achieve a better speech conversion effect.
In step S103 of some embodiments, the encoding network is used to encode the sample speech data, extract the audio content features in the sample speech data, and obtain an initial audio feature vector, where the initial audio feature vector is a continuous vector, so that the audio content information in the sample speech data can be extracted more conveniently, interference caused by updating the model by other redundant information is eliminated, and updating accuracy of the model can be improved.
Referring to fig. 2, in some embodiments, step S104 may include, but is not limited to, steps S201 to S204:
step S201, dividing the initial audio feature vector to obtain a plurality of audio frame vectors;
step S202, carrying out index query on the audio frame vectors based on reference vectors of a preset codebook to obtain audio frame indexes corresponding to each audio frame vector, wherein the preset codebook comprises one-to-one corresponding reference vectors and audio frame indexes, and mapping relations between each audio frame index and phoneme features;
step S203, extracting phoneme features corresponding to the audio frame vectors according to the audio frame indexes and the mapping relation;
step S204, merging the phoneme features of all the audio frame vectors to obtain an initial phoneme feature vector.
In step S201 of some embodiments, the initial audio feature vector may be subjected to a segmentation process by using a vector quantization technique, and the continuous initial audio feature vector is converted into a plurality of discrete vectors, so as to obtain a plurality of audio frame vectors, where the audio frame vectors include the audio content features of the sample speech data at each time frame.
In step S202 of some embodiments, the preset codebook is an initialized vector set, the preset codebook includes a plurality of reference vectors and audio frame indexes, the reference vectors and the audio frame indexes are in one-to-one correspondence, and a one-to-one mapping relationship exists between each audio frame index and the phoneme feature. For example, the preset codebook includes 128 reference vectors, and each reference vector corresponds to one audio frame index, that is, the index range of the audio frame index is 0 to 127, and each audio frame index corresponds to one phoneme feature. Therefore, the audio frame vector can be searched for by indexing based on the reference vector of the preset codebook, that is, the cosine similarity between each audio frame vector and each reference vector in the preset codebook is calculated, and the audio frame index corresponding to the reference vector with the minimum cosine similarity is used as the audio frame index corresponding to the audio frame vector.
In step S203 of some embodiments, since there is a one-to-one mapping relationship between each audio frame index and the phoneme feature, after determining the audio frame index corresponding to each audio frame vector, extracting the phoneme feature corresponding to the audio frame index according to the mapping relationship, and taking the extracted phoneme feature as the phoneme feature of the audio frame vector.
In step S204 of some embodiments, according to the sequence of the audio frame vectors in the time dimension, the phoneme features obtained by the query are spliced in sequence to obtain an initial phoneme feature vector.
For example, after the sample voice data M is encoded through the encoding network, an initial audio feature vector Z is obtained, the initial audio feature vector is divided into a plurality of audio frame vectors P by using a vector quantization technology, reference vectors corresponding to each audio frame vector P are queried frame by frame in a preset codebook, namely, the cosine similarity between each audio frame vector and each reference vector in the preset codebook is compared, an audio frame index corresponding to the reference vector with the minimum cosine similarity is used as an audio frame index corresponding to the audio frame vector, then the phoneme feature corresponding to the audio frame index is queried, and the queried phoneme feature is used as the phoneme feature of the audio frame vector P. Finally, according to the sequence of the audio frame vector in the time dimension, the phoneme features obtained by inquiry are spliced in sequence to obtain an initial phoneme feature vector Q.
Through the steps S201 to S204, the phoneme features corresponding to the sample speech data can be conveniently queried according to the preset codebook, and the speech text content information represented by the sample speech data is obtained, so that the speech conversion can be performed based on the obtained speech text content information (i.e. the initial phoneme feature vector), which is helpful for training the learning ability of the model on the text content information of the sample speech data and improving the speech conversion performance of the model.
Referring to fig. 3, in some embodiments, step S105 may include, but is not limited to, steps S301 to S302:
step S301, carrying out duration prediction on the initial phoneme feature vector based on a preset time predictor to obtain a duration sequence of the initial phoneme feature vector;
step S302, performing voice alignment on the initial phoneme feature vector according to the duration sequence to obtain a sample audio embedding vector.
In step S301 of some embodiments, the preset time predictor may include a length adjuster, three multi-head attention layers and a full-connection layer, where the length adjuster is mainly used to simulate and adjust feature lengths of the initial phoneme feature vectors, so that the adjusted initial phoneme feature vectors can be aligned with frames and voices, the multi-head attention layers and the full-connection layer are mainly used to extract time feature information in the initial phoneme feature vectors, comprehensively analyze and predict phoneme types of the initial phoneme feature vectors and number of phonemes corresponding to each phoneme type according to importance degrees of different time feature information, and finally construct a duration sequence of the initial phoneme feature vectors according to the phoneme types and number of phonemes of each phoneme type.
In step S302 of some embodiments, before the initial phoneme feature vector is aligned, an embedding process is further required for the initial phoneme feature vector, and the initial phoneme feature vector is mapped to a fixed vector space, so as to obtain an audio text embedding vector containing speech content information. And dividing the audio text embedded vector into a plurality of embedded vector segments, wherein each embedded vector segment corresponds to one frame of phoneme characteristic of the sample audio data in the time dimension, so that the embedded vector segments with the same number of frames as the sample voice data can be obtained, and classifying the embedded vector segments based on the phoneme category to obtain an intermediate vector of each phoneme category. Further, calculating a vector average value of the intermediate vector of each phoneme category, and copying the vector average value according to the number of phonemes of the phoneme category to obtain a target vector of the phoneme category; finally, the target vectors of all the phoneme categories are spliced, so that the voice alignment of the initial phoneme feature vector can be realized, and a sample audio embedding vector is obtained.
Through the steps S301 to S302, the number of elements and the element value of the duration sequence can be determined according to the phoneme information of the sample speech data, and the alignment processing is performed on the speech content information of the initial phoneme feature vector and the audio length of the sample speech data according to the element condition of the duration sequence, so as to obtain the sample audio embedded vector which can represent the text content feature of the sample speech data and has the audio length consistent with the sample speech data, thereby being beneficial to adjusting the speech length of the generated synthesized speech data and improving the accuracy of speech conversion.
Referring to fig. 4, in some embodiments, step S301 may include, but is not limited to, steps S401 to S403:
step S401, carrying out segmentation processing on the initial phoneme feature vector to obtain a plurality of phoneme feature fragments;
step S402, identifying the phoneme characteristic fragments based on a preset time predictor to obtain the phoneme category of the initial phoneme characteristic vector and the number of phonemes of each phoneme category;
step S403, obtaining a duration sequence according to the phoneme category and the number of phonemes.
In step S401 of some embodiments, the initial phoneme feature vector is first split by using a duration predictor, and the initial phoneme feature vector is split into feature vectors of a frame-by-frame type, so as to obtain a plurality of phoneme feature segments, where each phoneme feature segment corresponds to a mel-frequency cepstrum frame.
In step S402 of some embodiments, the feature length of the initial phoneme feature vector is simulated and adjusted by using the length adjuster in the duration predictor, the time feature information in the initial phoneme feature vector is extracted by using the multi-head attention layer and the full-connection layer, and the phoneme class of the initial phoneme feature vector and the number of phonemes corresponding to each phoneme class are predicted according to the importance degree of the different time feature information, wherein the initial phoneme feature vector includes a phoneme class, and the number of occurrences of each phoneme is the number of phonemes of the phoneme class.
In step S403 of some embodiments, when constructing the duration sequence, the phoneme category may be taken as the element category number of the duration sequence, and the phoneme number may be taken as the element value of each element. For example, if the number of frames of a certain sample speech data is 7, the initial phoneme feature vector contains 7 phoneme features, wherein the initial phoneme feature vector contains three phoneme categories, namely, a phoneme category a, a phoneme category B and a phoneme category C, and the number of phonemes of the phoneme category a is 2, the number of phonemes of the phoneme category B is 1, and the number of phonemes of the phoneme category C is 4, and the duration sequence of the sample speech data is [2,1,4].
The number and the element value of the element duration sequence can be determined according to the phoneme information of the initial phoneme feature vector through the steps S401 to S403, so that the duration of the phoneme of the initial phoneme feature vector is converted into a sequence representation form, the voice length can be adjusted based on the duration sequence in the subsequent voice conversion process, the voice consistency of the converted synthesized voice data is improved, and the voice conversion effect is improved.
Referring to fig. 5, in some embodiments, step S302 may include, but is not limited to, steps S501 to S504:
Step S501, embedding an initial phoneme feature vector to obtain an audio text embedded vector;
step S502, performing segmentation processing on the audio text embedded vectors according to the duration time sequence to obtain intermediate vectors of each phoneme class, wherein the number of the intermediate vectors is equal to the number of the phonemes;
step S503, carrying out mean value calculation on the intermediate vectors to obtain candidate vectors of each phoneme category;
step S504, copying the candidate vectors according to the number of the phonemes to obtain target vectors of each phoneme category, wherein the number of the target vectors is equal to the number of the phonemes;
and step S505, performing splicing processing on all the target vectors to obtain a sample audio embedded vector.
In step S501 of some embodiments, an embedding process is performed on the initial phoneme feature vector through an encoding network, and the initial phoneme feature vector is mapped to a fixed vector space to obtain an audio text embedding vector, where the audio text embedding vector contains speech text content information of sample speech data.
In step S502 of some embodiments, the audio text embedding vector is subjected to a segmentation process according to the sum of element values of the duration sequence, resulting in an intermediate vector for each phoneme class. Specifically, if a certain duration sequence is [2,1,4], the sum of element values is 2+1+4=7, the audio text embedded vector is divided into 7 intermediate vectors, each corresponding to one frame segment of the sample speech data. Since one frame segment corresponds to one phoneme feature, one intermediate vector corresponds to one phoneme feature, and the number of intermediate vectors is equal to the number of phonemes. For example, the number of phonemes of the phoneme class a is 2, and there are two intermediate vectors of the phoneme class a.
In step S503 of some embodiments, all intermediate vectors belonging to the same phoneme category are averaged to obtain a candidate vector corresponding to each phoneme category. Specifically, vector summation is performed on all intermediate vectors of a certain phoneme class, the number of phonemes of the phoneme class is determined at the same time, quotient is performed on the vector summation result and the number of phonemes to obtain an average vector of the phoneme class, and the average vector is used as a candidate vector of the phoneme class.
In step S504 of some embodiments, the candidate vector is copied according to the number of phonemes of each phoneme class, to obtain a target vector of each phoneme class. For example, if a phoneme class includes a phonemes, the number of phonemes in the phoneme class is a, and the candidate vector of the phoneme class is copied a times to obtain the target vector of the phoneme class. From this, the number of target vectors of the phoneme class is equal to the number of phonemes.
In step S505 of some embodiments, all target vectors belonging to the same sample speech data are spliced, so that the speech alignment of the initial phoneme feature vector can be achieved, and a sample audio embedding vector is obtained.
Through the steps S501 to S505, the voice content information of the sample voice data and the audio length can be aligned according to the element types and the element values of the duration time sequence, so that the model can better learn the voice content information of the sample voice data, the feature constraint of model training is improved, the learning and generalization capability of the model is improved, and the model can better realize voice conversion.
Referring to fig. 6, before step S106 of some embodiments, the method for updating a model further includes, but is not limited to, steps S601 to S604:
step S601, inputting sample voice data into a preset voiceprint recognition model, wherein the voiceprint recognition model comprises a segmentation layer and a hidden layer;
step S602, dividing the sample voice data based on the dividing layer to obtain a plurality of sample voice fragments;
step S603, carrying out style recognition on each sample voice fragment based on the hidden layer to obtain a plurality of initial style embedded vectors;
in step S604, mean value calculation is performed on the plurality of initial style embedding vectors to obtain a speaking style embedding vector.
In step S601 of some embodiments, sample voice data may be directly input into a preset voiceprint recognition model, where the voiceprint recognition model may be constructed based on a deep convolutional neural network, and the voiceprint recognition model includes a segmentation layer and a hidden layer.
In step S602 of some embodiments, the sample speech data is split into audio segments frame by frame based on the splitting processing, so as to obtain a plurality of sample speech segments.
In step S603 of some embodiments, style feature extraction is sequentially performed on each sample speech segment based on the sequence of the hidden layer and the sample speech data in the time dimension, so as to obtain an initial style embedded vector corresponding to each sample speech segment, where the initial style embedded vector includes speaking style characteristics of the sample speech segment, for example, the initial style embedded vector includes feature information of the sample speech segment in terms of pitch, frequency, tone, and other speech.
In step S604 of some embodiments, the average style embedding vector of the sample speech data is obtained by averaging the plurality of initial style embedding vectors, and the average style embedding vector is used as a speaking style embedding vector of the sample speaking object, where the speaking style embedding vector includes speaking style characteristics of the sample speaking object, for example, the speaking style embedding vector may characterize whether the sample speaking object is jerky or slow, is hypersonic or undershot, and the like when speaking.
Through the steps S601 to S604, the speaking style information of the sample voice data can be conveniently extracted, the extracted speaking style information is used for training and updating the model, the learning ability of the model on the characteristics of the voice style is improved, and the effect of the model on voice conversion is improved.
In step S106 of some embodiments, vector splicing is performed on the sample audio embedded vector and the speaking style embedded vector to obtain a synthesized audio vector, decoupling processing is performed on the spliced vector through a decoding network to obtain synthesized mel-frequency cepstrum data, and the synthesized mel-frequency cepstrum data is converted into a waveform form based on a vocoder and the like in the decoding network to obtain synthesized voice data, wherein the speaking style embedded vector includes speaking style information of a sample speaking object, and in this way, the synthesized voice data can include voice content relatively close to the sample voice data and speaking style information of the sample speaking object, so that the obtained synthesized voice data has better audio quality.
In step S107 of some embodiments, the process of performing the loss calculation on the synthesized voice data and the sample voice data by the preset loss function may be expressed as shown in formula (1):
L recon =||x-x′|| 1 Formula (1)
Wherein L is recon For the model loss value, x is the sample speech data and x' is the synthesized speech data. Loss value L by model recon The size of the model can reflect the similarity degree of the sample voice data and the synthesized voice data more clearly, and meanwhile, the model loss value L is used for calculating the model loss value recon The size of (2) can also clearly reflect the degree of training of the model.
Further, since the sample speech data is derived from the sample speaking object, the speaking style embedding vector used for training the model is also derived from the sample speaking object, and thus the synthesized speech data obtained through the neural network model should be as close as possible to the sample speech data, i.e., it is necessary to make the model loss value as small as possible. Therefore, it is necessary to update parameters of the neural network model based on the model loss value, and the model parameters of the neural network model are updated so that the synthesized speech data obtained through the neural network model is more close to the sample speech data. When the model loss value is smaller than or equal to a preset loss threshold value after multiple parameter updating, the similarity degree of the synthesized voice data and the sample voice data is good, the voice conversion effect of the neural network model can meet the current requirement, and updating of the neural network model is stopped to obtain the voice conversion model.
The model updating method comprises the steps of obtaining sample voice data of a sample speaking object; inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the method comprises the steps of performing coding processing on sample voice data through a coding network to obtain an initial audio feature vector; index inquiry is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector; and carrying out voice alignment on the initial phoneme feature vector to obtain a sample audio embedded vector which can represent text content features of sample voice data and has the audio length consistent with that of the sample voice data, thereby being beneficial to adjusting the voice length of the generated synthesized voice data. The sample audio embedded vector and the pre-acquired speaking style embedded vector are decoded through a decoding network to obtain synthesized voice data, so that the synthesized voice data contains voice content and speaking style characteristics which are relatively close to those of the sample audio data; finally, parameter updating is carried out on the neural network model based on the synthesized voice data and the sample voice data, so that the model is more focused on learning the similarity of the synthesized voice data and the sample voice data in voice content and voice style, the updating effect of the model can be effectively improved, and the accuracy of voice conversion by the voice conversion model is improved.
Referring to fig. 7, the embodiment of the present application further provides a voice conversion method, which may include, but is not limited to, steps S701 to S702:
step S701, obtaining target speaking style information of a target speaking object and original voice data to be processed;
step S702, inputting the original speech data and the target speaking style information into a speech conversion model for speech conversion to obtain target speech data, wherein the speech conversion model is obtained according to the model updating method of the first aspect.
In step S701 of some embodiments, by writing a web crawler, setting a data source, and then performing targeted crawling on the data, obtaining original audio data to be processed and target speaking style information of a target speaking object, where; the data sources may be various types of network platforms, social media, some specific audio databases, etc., and the original audio data may be musical material of a certain speaking object, a lecture report, a chat conversation, etc. The original audio data and the target speaking style information of the target speaking object may also be acquired by other means, not limited thereto. In addition, the target speaking style information may be derived from acoustic feature extraction of the speaking audio of the target speaking object, for example, by obtaining the audio data of the target speaking object from a network platform, social media or audio database, and obtaining the speaking style information such as the pitch feature and the tone feature of the target speaking object through a voiceprint recognition model or other d-vector technology.
In step S702 of some embodiments, the original audio data and the speaking style information of the target speaking object are input into a speech conversion model to perform speech conversion, the speech content of the original audio data is obtained through the speech conversion model, and then the speaking style information of the target speaking object is fused with the speech content of the original audio data, so that the conversion of the speaking style characteristics of the original audio data is realized, and the target audio data is obtained.
According to the voice conversion method, the original audio data is subjected to coding processing and voice alignment through the coding network of the voice conversion model to obtain the target audio embedded vector, the target audio embedded vector subjected to voice alignment and target speaking style information of the target speaking object are subjected to joint decoupling processing to form new audio data, namely the target audio data, so that the voice content of the target audio data is identical to the voice content of the original audio data, meanwhile, the target audio data contains the voice characteristics of the target speaking object, such as tone characteristics and tone characteristics, and the like, of the target speaking object, the conversion of the original speaking object corresponding to the original audio data into the target speaking object is realized under the condition that the speaking content information of the original audio data is not changed, the voice conversion effect can be effectively improved by the mode, further, the synthetic voice expressed by a dialogue robot can be more attached to the dialogue style preference of the dialogue object in the intelligent dialogue process of a insurance product, a financial product and the like, the dialogue quality and the dialogue style of the dialogue object are more interested are exchanged, the dialogue quality and the dialogue effectiveness are improved, the customer satisfaction service satisfaction rate is improved, and customer satisfaction is improved.
Referring to fig. 8, an embodiment of the present application further provides a model updating device, which may implement the above model updating method, where the device includes:
a data acquisition module 801, configured to acquire sample speech data of a sample speaking object;
an input module 802, configured to input sample voice data into a preset neural network model, where the neural network model includes an encoding network and a decoding network;
the encoding module 803 is configured to encode the sample speech data through an encoding network to obtain an initial audio feature vector;
the query module 804 is configured to perform index query on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and perform phoneme feature extraction on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector;
a speech alignment module 805, configured to perform speech alignment on the initial phoneme feature vector to obtain a sample audio embedding vector;
the decoding module 806 is configured to decode, through a decoding network, the sample audio embedded vector and the pre-acquired speaking style embedded vector to obtain synthesized speech data; wherein the speaking style embedding vector includes speaking style information of the sample speaking object;
The model updating module 807 is configured to update parameters of the neural network model based on the synthesized voice data and the sample voice data, to obtain a voice conversion model.
The specific implementation manner of the model updating device is basically the same as that of the specific embodiment of the model updating method, and is not described herein.
In addition, an embodiment of the present application further provides a voice conversion device, which may implement the voice conversion method, where the device includes:
the acquisition module is used for acquiring target speaking style information of a target speaking object and original voice data to be processed;
the voice conversion module is used for inputting the original voice data and the target speaking style information into the voice conversion model to perform voice conversion to obtain target voice data, wherein the voice conversion model is obtained according to the model updating device.
The specific implementation of the voice conversion device is basically the same as the specific embodiment of the voice conversion method, and will not be described herein.
The embodiment of the application also provides electronic equipment, which comprises: the system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the model updating method or the voice conversion method when being executed by the processor. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.
Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;
the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the model updating method or the speech conversion method to perform the embodiments of the present application;
an input/output interface 903 for inputting and outputting information;
the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);
A bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);
wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the model updating method or the voice conversion method.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the application provides a model updating method, a voice conversion method, a model updating device, electronic equipment and a computer readable storage medium, wherein the model updating method, the voice conversion method, the model updating device, the electronic equipment and the computer readable storage medium are used for acquiring sample voice data of a sample speaking object; inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the method comprises the steps of performing coding processing on sample voice data through a coding network to obtain an initial audio feature vector; index inquiry is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector; and carrying out voice alignment on the initial phoneme feature vector to obtain a sample audio embedded vector which can represent text content features of sample voice data and has the audio length consistent with that of the sample voice data, thereby being beneficial to adjusting the voice length of the generated synthesized voice data. The sample audio embedded vector and the pre-acquired speaking style embedded vector are decoded through a decoding network to obtain synthesized voice data, so that the synthesized voice data contains voice content and speaking style characteristics which are relatively close to those of the sample audio data; finally, parameter updating is carried out on the neural network model based on the synthesized voice data and the sample voice data, so that the model is more focused on learning the similarity of the synthesized voice data and the sample voice data in voice content and voice style, the updating effect of the model can be effectively improved, and the accuracy of voice conversion by the voice conversion model is improved. According to the voice conversion model, under the condition that the speaking content information of the original audio data is not changed, the original speaking object corresponding to the original audio data is converted into the target speaking object, the voice content information and the speaking characteristics of the target speaking object can be well represented, the voice conversion effect can be effectively improved, further, in the process of ensuring intelligent conversations such as products and financial products, the synthetic voice expressed by the conversation robot can be more attached to the conversation style preference of the conversation object, conversation communication is carried out by adopting the conversation mode and the conversation style which are more interesting to the conversation object, conversation quality and conversation effectiveness are improved, intelligent voice conversation service can be realized, service quality of clients and customer satisfaction are improved, and therefore business success rate is improved.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting to embodiments of the present application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method of model updating, the method comprising:
acquiring sample voice data of a sample speaking object;
inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
the sample voice data is coded through the coding network, so that an initial audio feature vector is obtained;
index query is carried out on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and phoneme feature extraction is carried out on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector;
performing voice alignment on the initial phoneme feature vector to obtain a sample audio embedding vector;
the sample audio embedded vector and the pre-acquired speaking style embedded vector are decoded through the decoding network to obtain synthesized voice data; wherein the speaking style embedding vector includes speaking style information of the sample speaking object;
and updating parameters of the neural network model based on the synthesized voice data and the sample voice data to obtain a voice conversion model.
2. The method for updating a model according to claim 1, wherein the indexing the initial audio feature vector based on a preset codebook to obtain an audio frame index, and extracting phoneme features from the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector comprises:
dividing the initial audio feature vector to obtain a plurality of audio frame vectors;
index inquiry is carried out on the audio frame vectors based on the reference vectors of the preset codebook to obtain audio frame indexes corresponding to each audio frame vector, wherein the preset codebook comprises reference vectors and audio frame indexes which are in one-to-one correspondence, and mapping relations between each audio frame index and phoneme characteristics;
extracting phoneme features corresponding to the audio frame vectors according to the audio frame indexes and the mapping relation;
and merging the phoneme features of all the audio frame vectors to obtain the initial phoneme feature vector.
3. The method of claim 1, wherein said performing speech alignment on said initial phoneme feature vector to obtain a sample audio embedding vector comprises:
Performing duration prediction on the initial phoneme feature vector based on a preset time predictor to obtain a duration sequence of the initial phoneme feature vector;
and carrying out voice alignment on the initial phoneme characteristic vector according to the duration time sequence to obtain the sample audio embedding vector.
4. A model updating method according to claim 3, wherein the performing duration prediction on the initial phoneme feature vector based on a preset time predictor to obtain a duration sequence of the initial phoneme feature vector comprises:
dividing the initial phoneme feature vector to obtain a plurality of phoneme feature fragments;
identifying the phoneme characteristic fragments based on a preset time predictor to obtain the phoneme category of the initial phoneme characteristic vector and the number of phonemes of each phoneme category;
and obtaining the duration sequence according to the phoneme category and the phoneme number.
5. The method of model updating according to claim 4, wherein the performing speech alignment on the initial phoneme feature vector according to the duration sequence to obtain the sample audio embedding vector comprises:
Embedding the initial phoneme feature vector to obtain an audio text embedded vector;
dividing the audio text embedded vector according to the duration time sequence to obtain intermediate vectors of each phoneme category, wherein the number of the intermediate vectors is equal to the number of the phonemes;
carrying out mean value calculation on the intermediate vectors to obtain candidate vectors of each phoneme category;
copying the candidate vectors according to the number of the phonemes to obtain target vectors of each phoneme category, wherein the number of the target vectors is equal to the number of the phonemes;
and performing splicing processing on all the target vectors to obtain the sample audio embedded vector.
6. The method according to any one of claims 1 to 5, wherein before decoding the sample audio embedded vector and the pre-acquired speaking style embedded vector through the decoding network to obtain the synthesized speech data, the method further comprises acquiring the speaking style embedded vector, and specifically comprises:
inputting the sample voice data into a preset voiceprint recognition model, wherein the voiceprint recognition model comprises a segmentation layer and a hiding layer;
Dividing the sample voice data based on the dividing layer to obtain a plurality of sample voice fragments;
carrying out style recognition on each sample voice fragment based on the hidden layer to obtain a plurality of initial style embedded vectors;
and carrying out mean value calculation on the plurality of initial style embedded vectors to obtain the speaking style embedded vector.
7. A method of speech conversion, the method comprising:
acquiring target speaking style information of a target speaking object and original voice data to be processed;
inputting the original voice data and the target speaking style information into a voice conversion model for voice conversion to obtain target voice data, wherein the voice conversion model is obtained according to the model updating method of any one of claims 1 to 6.
8. A model updating apparatus, characterized in that the model updating apparatus comprises:
the data acquisition module is used for acquiring sample voice data of a sample speaking object;
the input module is used for inputting the sample voice data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
The coding module is used for coding the sample voice data through the coding network to obtain an initial audio feature vector;
the query module is used for carrying out index query on the initial audio feature vector based on a preset codebook to obtain an audio frame index, and carrying out phoneme feature extraction on the initial audio feature vector based on the audio frame index to obtain an initial phoneme feature vector;
the voice alignment module is used for carrying out voice alignment on the initial phoneme characteristic vector to obtain a sample audio embedding vector;
the decoding module is used for decoding the sample audio embedded vector and the pre-acquired speaking style embedded vector through the decoding network to obtain synthesized voice data; wherein the speaking style embedding vector includes speaking style information of the sample speaking object;
and the model updating module is used for updating parameters of the neural network model based on the synthesized voice data and the sample voice data to obtain a voice conversion model.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing when executing the computer program:
The model updating method according to any one of claims 1 to 6;
or alternatively, the process may be performed,
the speech conversion method of claim 7.
10. A computer readable storage medium storing a computer program, the computer program being implemented when executed by a processor:
the model updating method according to any one of claims 1 to 6;
or alternatively, the process may be performed,
the speech conversion method of claim 7.
CN202310638552.4A 2023-05-31 2023-05-31 Model updating method and device, voice conversion method and device and storage medium Pending CN116543780A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310638552.4A CN116543780A (en) 2023-05-31 2023-05-31 Model updating method and device, voice conversion method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310638552.4A CN116543780A (en) 2023-05-31 2023-05-31 Model updating method and device, voice conversion method and device and storage medium

Publications (1)

Publication Number Publication Date
CN116543780A true CN116543780A (en) 2023-08-04

Family

ID=87448823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310638552.4A Pending CN116543780A (en) 2023-05-31 2023-05-31 Model updating method and device, voice conversion method and device and storage medium

Country Status (1)

Country Link
CN (1) CN116543780A (en)

Similar Documents

Publication Publication Date Title
CN115641834A (en) Voice synthesis method and device, electronic equipment and storage medium
CN116543768A (en) Model training method, voice recognition method and device, equipment and storage medium
CN114021582B (en) Spoken language understanding method, device, equipment and storage medium combined with voice information
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116578688A (en) Text processing method, device, equipment and storage medium based on multiple rounds of questions and answers
CN112434514A (en) Multi-granularity multi-channel neural network based semantic matching method and device and computer equipment
CN115394321A (en) Audio emotion recognition method, device, equipment, storage medium and product
CN116611459B (en) Translation model training method and device, electronic equipment and storage medium
CN117275466A (en) Business intention recognition method, device, equipment and storage medium thereof
CN116702736A (en) Safe call generation method and device, electronic equipment and storage medium
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116644765A (en) Speech translation method, speech translation device, electronic device, and storage medium
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116580704A (en) Training method of voice recognition model, voice recognition method, equipment and medium
CN113836308B (en) Network big data long text multi-label classification method, system, device and medium
CN115273805A (en) Prosody-based speech synthesis method and apparatus, device, and medium
CN115995225A (en) Model training method and device, speech synthesis method and device and storage medium
CN116543780A (en) Model updating method and device, voice conversion method and device and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN116564274A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN115641860A (en) Model training method, voice conversion method and device, equipment and storage medium
CN116469372A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN115206333A (en) Voice conversion method, voice conversion device, electronic equipment and storage medium
CN116645961A (en) Speech recognition method, speech recognition device, electronic apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination